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TRANSFERRING AND QUEUEING LENGTH AND DATA AS ONE STREAM 



FIELD OF THE INVENTION 



The present invention is related to storing multiple 



packets within one memory word in a wide cache buffer structure. 
5 More specifically, the present invention is related to storing 
multiple packets within one memory word in a wide cache buffer 
structure by appending packet length information to a 



0 used where multiple packets are packed within one memory word to 
optimize buffer access bandwidth. With this, and because BFS can 
switch packets of different lengths, packet boundary information is 
lost in the wide cache buffer. If the packet boundary information 
(i.e. the packet length calculated by Aggregators) were to be sent 

5 on a different bus to Separators, (which need this to extract 
packets and send them to different Port Cards) , then the Memory 
Controllers have to take this on a bus independent of data from 
Aggregators, Queue it up independent of data, and send it out to 
Separators on a bus independent of data. Also, within the Memory 

0 Controllers data queue link lists, and length information queue 
link lists have to be synchronized. 



BACKGROUND OF THE INVENTION 
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In BFS Memory Controller, Wide Cache buffer structure is 



SUMMARY OF THE INVENTION 



5 



The present invention pertains to a switch for switching 
packets. Each packet has a length. The switch comprises a port 
card which receives packets from and sends packets to a network. 
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The switch comprises fabrics connected to the port card which 
switch the packets. Each fabric has a memory mechanism. Each 
fabric has a mechanism for determining the length of each packet 
received by the fabric and placing a length indicator with the 
5 packet so when the packet is stored in the memory mechanism, the 
determining mechanism can identify from the length indicator how 
long the packet is and where the packet ends in the memory 
mechanism. 

The present invention pertains to a method for switching 
10 packets having a length. The method comprises the steps of 
receiving a packet at a port card of a switch. Then there is the 
step of sending fragments of the packet to fabrics of the switch. 
Next there is the step of receiving the fragments of the packet at 
the fabrics of the switch. Then there is the step of measuring the 
15 length of the packet at each fabric from the fragments of the 
packet received at each fabric. Next there is the step of 
appending a length indicator to the packet. Then there is the step 
of storing the packet with the length indicator in a memory 
mechanism of the fabric. Next there is the step of reading the 
20 packet from the memory mechanism. Then there is the step of 
determining where the packet ends from the length indicator of the 
packet . 

BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred embodiment of 
25 the invention and preferred methods of practicing the invention are 
illustrated in which: 



L 
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Figure 1 is a schematic representation of packet striping 
in the switch of the present invention. 

Figure 2 is a schematic representation of an OC 48 port 

card. 

5 Figure 3 is a schematic representation of a concatenated 

network blade. 

Figure 4 is a schematic representation regarding the 
connectivity of the fabric ASICs. 

Figure 5 is a schematic representation of sync pulse 
10 distribution. 

Figure 6 is a schematic representation regarding the 
relationship between transmit and receive sequence counters for the 
separator and unstriper, respectively. 

Figure 7 is a schematic representation of a switch of the 
15 present invention. 

Figure 8 is a schematic representation of a packet with 
a length indicator. 

DETAILED DESCRIPTION 

Referring now to the drawings wherein like reference 
20 numerals refer to similar or identical parts throughout the several 
views, and more specifically to figure 7 thereof, there is shown a 
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switch 10 for switching packets 11. Each packet 11 has a length. 
The switch 10 comprises a port card 12 which receives packets 11 
from and sends packets 11 to a network 16. The switch 10 comprises 
fabrics 14 connected to the port card 12 which switch the packets 
5 11. Each fabric 14 has a memory mechanism 18. Each fabric 14 has 
a mechanism for determining the length of each packet 11 received 
by the fabric 14 and placing a length indicator 22 with the packet 
11 so when the packet 11 is stored in the memory mechanism 18, the 
determining mechanism 20 can identify from the length indicator 22 
10 how long the packet 11 is and where the packet 11 ends in the 
memory mechanism 18. 

Preferably, the determining mechanism 20 includes an 
aggregator 24 which receives packet 11 fragments 26 from a strik er 
40 of the _por±__ca rd 1 2, determines the packet 11 length and appends 

15 packet 11 length information 28 to the beginning of the packet 11 
in the length indicator 22, as shown in figure 8. Referring again 
to figure 1, the memory mechanism 18 preferably includes a memory 
controller 30. The aggregator 24 sends the packet 11 with the 
packet 11 length information 28 to the memory controller 30 which 

20 stores the packet 11 with the packet 11 length information 28. 
Preferably, the memory controller 30 has a memory 32 which has a 
wide cache buffer structure in which multiple packets 11 are put 
into one word 34. The striper 40 preferably sends corresponding 
fragments 26 of a packet 11 to the aggregator 24 of each of the 

25 fabrics 14 during the same logical time. 

The fabric 14 preferably includes a separator 36 which 
reads the packets 11 from the memory controller 30 and extracts the 
packet 11 length information 28 from each packet 11 to determine 
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when each packet 11 ends, and sends fragments 26 of the packet 11 
to the port card 12. Preferably, the separator 36 removes the 
packet 11 length information 28 from each packet 11 before sending 
any fragments 26 of each packet 11 to an unstriper 38 of the port 
5 card 12. 

The present invention pertains to a method for switching 
packets 11 having a length. The method comprises the steps of 
receiving a packet 11 at a port card 12 of a switch 10. Then there 
is the step of sending fragments 26 of the packet 11 to fabrics 14 

10 of the switch 10. Next there is the step of receiving the 
fragments 26 of the packet 11 at the fabrics 14 of the switch 10. 
Then there is the step of measuring the length of the packet 11 at 
each fabric 14 from the fragments 26 of the packet 11 received at 
each fabric 14. Next there is the step of appending a length 

15 indicator 22 to the packet 11. Then there is the step of storing 
the packet 11 with the length indicator 22 in a memory mechanism 18 
of the fabric 14. Next there is the step of reading the packet 11 
from the memory mechanism 18. Then there is the step of 
determining where the packet 11 ends from the length indicator 22 

20 of the packet 11. 

Preferably, the receiving step includes the step of 
receiving the fragment 26 at an aggregator 24 of the fabric 14. 
The measuring step preferably includes the step of measuring the 
length of the packet 11 with the aggregator 24. Preferably, the 
25 appending step includes the step of the appending the length 
indicator 22 to the packet 11 with the aggregator 24. The storing 
step preferably includes the step of storing the packet 11 with the 
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length indicator 22 in a memory controller 30 of the memory 
mechanism 18. 

Preferably, the reading step includes the step of reading 
the packet 11 from the memory controller 30 with a separator 36 of 
5 the fabric 14. The determining step preferably includes the step 
of determining where a packet 11 ends from the length indicator 22 
with the separator 36. Preferably, after the determining step 
there is the step of removing the packet 11 length information 28 
from the separator 36. 

10 After the removing step, there is preferably the step of 

sending fragments 26 of the packets 11 from the separator 36 to the 
port card 12. Preferably, the sending fragments 26 step includes 
the step of sending fragments 26 of the packet 11 to the port card 
12 in a same logical time with corresponding fragments 26 from 

15 other fabrics to the port card 12. The storing step preferably 
includes the step of storing the fragments 26 of the packet 11 in 
a memory 32 of the memory controller 30 which has a wide cache 
buffer structure in which multiple packets 11 are put into one word 
34 . 

20 Preferably, after the reading step, there is the step of 

extracting the packet 11 length information 28 from the packet 11 
with a separator 36. The receiving step preferably includes the 
step of receiving the fragments 26 of the packet 11 from the 
fabrics 14 with an unstriper 38 of the port card 12. Preferably, 

25 the sending fragments 26 to the fabric 14 step includes the step of 
sending with a striper 40 of the port card 12 to the aggregator 24 
of each fabric 14 the fragments 26 of the packet 11. The step of 
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senciing fragments 26 to the port card 12 preferably includes the 
step of sending fragments 26 from the separator 36 to an unstriper 
38 of the port card 12. 

In the operation of the invention, in BFS Memory 
5 controller 30, Wide Cache buffer structure is used to better 
utilize the limited amount of buffer access bandwidth to enqueue 
and dequeue high amounts of traffic. With this multiple packets 
are packed together into a wide memory word 34, and only one write 
and one read is done to the buffer for all the packets that are 
10 part of that word 34. Since, multiple packets are put into one 
word 34, the information about where one packet 11 ends and another 
starts have to be maintained, i.e. packet length has to be 
maintained. This can be done in two ways. 

In one approach, the packet 11 length can be maintained 
15 in a link list independent of the packet 11 data. This would 
require that another wide cache buffer structure is used, where 
each word 34 can hold length information 28 for up to N packets, 
where N is the maximum number packets that could start in one word 
34 of the packet data buffer. With wide range of packet sizes that 
20 are supported by BFS (from 40 byte packet to 64K byte packet) , N 
has to be computed based on the smallest size packets that can be 
put together into one word 34 of the wide data buffer. But, on 
average if the packets were larger than the smallest size packet, 
then most of the memory 32 in the buffer used to store length 
25 information 28 will be wasted. Also, since packet 11 length and 
packet 11 data has to be sent at the same time from Memory 
Controllers to Separators, the link list handling length buffer has 
to be synchronized (kept lock-step) with the link list handling 
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packet 11 data buffer. This approach also requires to have a 
separate bus for transferring the length information 28 
independently across multiple physical devices. 

In the operation of the invention, by appending packet 11 
5 length information 28 to the beginning of a packet 11 in the 
Aggregators, and sending it as one stream through Memory 
Controllers and to Separators, Memory Controllers doesn ! t have to 
handle packet 11 length and packet 11 data separately. This helps 
Memory controller 30 design, as it doesn't have to maintain 
10 separate data and length link lists for same packet 11 stream. 
Also, this saves lot of memory 32, as because of the wide cache 
memory approach for data, length memory also has to be similarly 
implemented. With different packet 11 sizes, on average, lot of 
this length memory could be wasted. 

15 In a preferred approach, packet 11 length information 28 

is queued and transmitted along with the packet 11 data. In this 
case, packet 11 length information 28 (which is always a fixed 
number of bits) is attached to the beginning of each packet 11 
data. Then this entity containing both packet 11 length and data 

20 is queued as one. Even in this approach, multiple packets are 
packed together in each word 34 of the wide cache buffer. When 
packets are read from this buffer and sent to Separator ASIC, it 
will extract the length field for the first packet 11, and based on 
that decides where the first packet 11 ends in the data stream, as 

25 the length field gives the number of bits of packet 11 length and 
data together. The bit after the end of the first packet 11 data 
is the start bit of the second packet 11 length field. Then again 
based on this length value end of the second packet 11 data, and 
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start of third packet 11 length is extracted. This way, all 
packets can be extracted from the combined stream. 

The switch uses RAID techniques to increase overall 
switch bandwidth while minimizing individual fabric bandwidth. In 
5 the switch architecture, all data is distributed evenly across all 
fabrics so the switch adds bandwidth by adding fabrics and the 
fabric need not increase its bandwidth capacity as the switch 
increases bandwidth capacity. 

Each fabric provides 40G of switching bandwidth and the 
10 system supports 1, 2, 3, 4, 6, or 12 fabrics, exclusive of the 
redundant /spare fabric. In other words, the switch can be a 40G, 
80G, 120G, 160G, 240G, or 480G switch depending on how many fabrics 
are installed. 

A portcard provides 10G of port bandwidth. For every 4 
15 portcards, there needs to be 1 fabric. The switch architecture 
does not support arbitrary installations of portcards and fabrics. 

The fabric ASICs support both cells and packets. As a 
whole, the switch takes a "receiver make right" approach where the 
egress path on ATM blades must segment frames to cells and the 
20 egress path on frame blades must perform reassembly of cells into 
packets . 

There are currently eight switch ASICs that are used in 
the switch: 
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Striper - The Striper resides on the portcard and 
SCP-IM. It formats the data into a 12 bit data 
stream, appends a checkword, splits the data stream 
across the N, non-spare fabrics in the system, 
generates a parity stripe of width equal to the 
stripes going to the other fabric, and sends the 
N+l data streams out to the backplane. 

Unstriper - The Unstriper is the other portcard 
ASIC in the the switch architecture. It receives 
data stripes from all the fabrics in the system. It 
then reconstructs the original data stream using 
the checkword and parity stripe to perform error 
detection and correction. 

Aggregator - The Aggregator takes the data streams 
and routewords from the Stripers and multiplexes 
them into a single input stream to the Memory 
Controller . 

Memory Controller - The Memory controller 
implements the queueing and dequeueing mechanisms 
of the switch. This includes the proprietary wide 
memory interface to achieve the simultaneous en- 
/de-queueing of multiple cells of data per clock 
cycle. The dequeueing side of the Memory Controller 
runs at 80Gbps compared to 40Gbps in order to make 
the bulk of the queueing and shaping of connections 
occur on the portcards. 
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5. Separator - The Separator implements the inverse 
operation of the Aggregator. The data stream from 
the Memory Controller is demultiplexed into 
multiple streams of data and forwarded to the 
5 appropriate Unstriper ASIC. Included in the 

interface to the Unstriper is a queue and flow 
control handshaking . 



There are 3 different views one can take of the 
connections between the fabric: physical, logical, and "active." 

10 Physically, the connections between the portcards and the fabrics 
are all gigabit speed differential pair serial links. This is 
strictly an implementation issue to reduce the number of signals 
going over the backplane. The "active" perspective looks at a 
single switch configuration, or it may be thought of as a snapshot 

15 of how data is being processed at a given moment. The interface 
between the fabric ASIC on the portcards and the fabrics is 
effectively 12 bits wide. Those 12 bits are evenly distributed 
("striped") across 1, 2, 3, 4, 6, or 12 fabrics based on how the 
fabric ASICs are configured. The "active" perspective refers to the 

20 number of bits being processed by each fabric in the current 
configuration which is exactly 12 divided by the number of fabrics. 



The logical perspective can be viewed as the union or max 
function of all the possible active configurations. Fabric slot #1 
can, depending on configuration, be processing 12, 6, 4, 3, 2, or 
25 1 bits of the data from a single Striper and is therefore drawn 
with a 12 bit bus. In contrast, fabric slot #3 can only be used to 
process 4, 3, 2, or 1 bits from a single Striper and is therefore 
drawn with a 4 bit bus. 
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Unlike previous switches, the switch really doesn't have 
a concept of a software controllable fabric redundancy mode. The 
fabric ASICs implement N+l redundancy without any intervention as 
long as the spare fabric is installed. 

5 As far as what does it provide; N+l redundancy means that 

the hardware will automatically detect and correct a single failure 
without the loss of any data. 

The way the redundancy works is fairly simple, but to 
make it even simpler to understand a specific case of a 120G switch 

10 is used which has 3 fabrics (A, B, and C) plus a spare (S) . The 
Striper takes the 12 bit bus and first generates a checkword which 
gets appended to the data unit (cell or frame) . The data unit and 
checkword are then split into a 4-bit-per-clock-cycle data stripe 
for each of the A, B, and C fabrics (A 3 A 2 A 1 A 0 , B 3 B 2 B 1 B 0 , and C^C^q) . 

15 These stripes are then used to produce the stripe for the spare 
fabric S 3 S 2 S 1 S 0 where S n = A n XOR B n XOR C n and these 4 stripes are 
sent to their corresponding fabrics. On the other side of the 
fabrics, the Unstriper receives 4 4-bit stripes from A, B, C, and 
S. All possible combinations of 3 fabrics (ABC, ABS, ASC, and SBC) 

20 are then used to reconstruct a "tentative" 12-bit data stream. A 
checkword is then calculated for each of the 4 tentative streams 
and the calculated checkword compared to the checkword at the end 
of the data unit. If no error occurred in transit, then all 4 
streams will have checkword matches and the ABC stream will be 

25 forwarded to the Unstriper output. If a (single) error occurred, 
only one checkword match will exist and the stream with the match 
will be forwarded off chip and the Unstriper will identify the 
faulty fabric stripe. 



For different switch configurations, i.e. 1, 2, 4, 6, or 
12 fabrics, the algorithm is the same but the stripe width changes. 

If 2 fabrics fail, all data running through the switch 
will almost certainly be corrupted. 

The fabric slots are numbered and must be populated in 
ascending order. Also, the spare fabric is a specific slot so 
populating fabric slots 1, 2, 3, and 4 is different than populating 
fabric slots 1, 2, 3, and the spare. The former is a 160G switch 
without redundancy and the latter is 120G with redundancy. 

Firstly, the ASICs are constructed and the backplane 
connected such that the use of a certain portcard slots requires 
there to be at least a certain minimum number of fabrics installed, 
not including the spare. This relationship is shown in Table 0. 

In addition, the APS redundancy within the switch is 
limited to specifically paired portcards. Portcards 1 and 2 are 
paired, 3 and 4 are paired, and so on through portcards' 47 and 48. 
This means that if APS redundancy is required, the paired slots 
must be populated together. 

To- give a simple example, take a configuration with 2 
portcards and only 1 fabric. If the user does not want to use APS 
redundancy, then the 2 portcards can be installed in any two of 
portcard slots 1 through 4. If APS redundancy is desired, then the 
two portcards must be installed either in slots 1 and 2 or slots 3 
and 4 . 
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Portcard 


Minimum 


Slot 


# of 




Fabrics 


1-4 


1 


5-8 


2 


9-12 


3 


13-16 


4 


17-24 


6 


25-48 


12 



Table 0: Fabric Requirements for Portcard Slot Usage 

10 To add capacity, add the new fabric (s) , wait for the 

switch to recognize the change and reconfigure the system to stripe 
across the new number of fabrics. Install the new portcards. 

Note that it is not technically necessary to have the 
full 4 portcards per fabric. The switch will work properly with 3 
15 fabrics installed and a single portcard in slot 12. This isn't cost 
efficient but it will work. 

To remove capacity, reverse the adding capacity 

procedure . 

If the switch is oversubscribed, i.e. install 8 portcards 
20 and only one fabric. 

It should only come about as the result of improperly 
upgrading the switch or a system failure of some sort. The reality 
is that one of two things will occur, depending on how this 
situation arises. If the switch is configured as a 40G switch and 
25 the portcards are added before the fabric, then the 5 th through 8 th 
portcards will be dead. If the switch is configured as 80G non- 
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redundant switch and the second fabric fails or is removed then all 
data through the switch will be corrupted (assuming the spare 
fabric is not installed) . And just to be complete, if 8 portcards 
were installed in an 80G redundant switch and the second fabric 
5 failed or was removed, then the switch would continue to operate 
normally with the spare covering for the failed/removed fabric. 

Figure 1 shows packet striping in the switch. 

The chipset supports ATM and POS port cards in both OC4 8 
and OC192c configurations. OC48 port cards interface to the 

10 switching fabrics with four separate OC48 flows. OC192 port cards 
logically combine the 4 channels into a 10G stream. The ingress 
side of a port card does not perform traffic conversions for 
traffic changing between ATM cells and packets. Whichever form of 
traffic is received is sent to the switch fabrics. The switch 

15 fabrics will mix packets and cells and then dequeue a mix of 
packets and cells to the egress side of a port card. 

The egress side of the port is responsible for converting 
the traffic to the appropriate format for the output port. This 
convention is referred to in the context of the switch as "receiver 
20 makes right". A cell blade is responsible for segmentation of 
packets and a cell blade is responsible for reassembly of cells 
into packets. To support fabric speed-up, the egress side of the 
port card supports a link bandwidth equal to twice the inbound side 
of the port card. 

25 The block diagram for a Poseidon-based ATM port card is 

shown as in Figure 2. Each 2 . 5G channel consists of 4 ASICs: 
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Inbound TM and striper ASIC at the inbound side and unstriper ASIC 
and outbound TM ASIC at the outbound side. 

At the inbound side, OC-48c or 4 0C-12c interfaces are 
aggregated. Each vortex sends a 2.5G cell stream into a dedicated 
5 striper ASIC (using the BIB bus, as described below) . The striper 
converts the supplied routeword into two pieces. A portion of the 
routeword is passed to the fabric to determine the output port(s) 
for the cell. The entire routeword is also passed on the data 
portion of the bus as a routeword for use by the outbound memory 
10 controller. The first routeword is termed the "fabric routeword". 
The routeword for the outbound memory controller is the "egress 
routeword" . 

At the outbound side, the unstriper ASIC in each channel 
takes traffic from each of the port cards, error checks and correct 
15 the data and then sends correct packets out on its output bus. The 
•unstriper uses the data from the spare fabric and the checksum 
inserted by the striper to detect and correct data corruption. 

Figure 2 shows an OC48 Port Card. 

The OC192 port card supports a single 10G stream to the 
20 fabric and between a 10G and 20G egress stream. This board also 
uses 4 stripers and 4 unstriper, but the 4 chips operate in 
parallel on a wider data bus. The data sent to each fabric is 
identical for both OC48 and OC192 ports so data can flow between 
the port types without needing special conversion functions. 



25 



Figure 3 shows a 10G concatenated network blade. 
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Each 40G switch fabric enqueues up to 40Gbps cells/frames 
and dequeue them at 80Gbps. This 2X speed-up reduces the amount of 
traffic buffered at the fabric and lets the outbound ASIC digest 
bursts of traffic well above line rate. A switch fabric consists of 
5 three kinds of ASICs: aggregators, memory controllers, and 
separators. Nine aggregator ASICs receive 40Gbps of traffic from up 
to 48 network blades and the control port. The aggregator ASICs 
combine the fabric route word and payload into a single data stream 
and TDM between its sources and places the resulting data on a wide 
10 output bus. An additional control bus (destid) is used to control 
how the memory controllers enqueue the data. The data stream from 
each aggregator ASIC then bit sliced into 12 memory controllers. 

The memory controller receives up to 16 cells/frames 
every clock cycle. Each of 12 ASICs stores 1/12 of the aggregated 
15 data streams. It then stores the incoming data based on control 
information received on the destid bus. Storage of data is 
simplified in the memory controller to be relatively unaware of 
packet boundaries (cache line concept) . All 12 ASICs dequeue the 
stored cells simultaneously at aggregated speed of 80Gbps. 

20 Nine separator ASICs perform the reverse function of the 

aggregator ASICs. Each separator receives data from all 12 memory 
controllers and decodes the routewords embedded in the data streams 
by the aggregator to find packet boundaries. Each separator ASIC 
then sends the data to up to 24 different unstripers depending on 

25 the exact destination indicated by the memory controller as data 
was being passed to the separator. 
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The dequeue process is back-pressure driven. If 
back-pressure is applied to the unstriper, that back-pressure is 
communicated back to the separator. The separator and memory 
controllers also have a back-pressure mechanism which controls when 
5 a memory controller can dequeue traffic to an output port. 

In order to support OC48 and OC192 efficiently in the 
chipset, the 4 OC48 ports from one port card are always routed to 
the same aggregator and from the same separator (the port 
connections for the aggregator & Sep are always symmetric). The 

10 table below shows the port connections for the aggregator & sep on 
each fabric for the switch configurations. Since each aggregator 
is accepting traffic from 10G of ports, the addition of 40G of 
switch capacity only adds ports to 4 aggregators. This leads to a 
differing port connection pattern for the first four aggregators 

15 from the second 4 (and also the corresponding separators) . 



TABLE 2 : Agg/Sep port connections 



Switch Size 


Agg 1 


Agg 2 


Agg 3 


Agg 4 


Agg 5 


Agg 6 


Agg 7 


Agg 8 


40 


1,2,3,4 


5,6,7,8 


9,10,11,12 


13,14,15, 16 










80 


1,2,3,4 


5,6,7,8 


9,10,11,12 


13,14,15, 16 


17,18,19, 20 


21,22,23,24 


25,26,27, 28 


29,30,31,32 


120 


1,2,3,4 


5,6,7,8 


9,10,11,12, 


13,14,15, 16, 


17,18,19, 20 


21,22,23,24 


25,26,27, 28 


29,30,31,32 




33,34,35,36 


37,38,39, 40 


41,42,43,44 


45,46,47, 48 










160 


1,2,3,4 


5,6,7,8 


9,10,11,12, 


13,14,15, 16, 


17,18,19, 20, 


21,22,23,24, 


25,26,27, 28, 


29,30,31,32, 




33,34,35,36 


37,38,39, 40 


41,42,43,44 


45,46,47, 48 


49,50,51,52 


53,54,55, 56 


57,58,59, 60 


61,62,63,64 



Figure 4 shows the connectivity of the fabric ASICs. 

The external interfaces of the switches are the Input Bus 
(BIB) between the striper ASIC and the ingress blade ASIC such as 
25 Vortex and the Output Bus (BOB) between the unstriper ASIC and the 
egress blade ASIC such as Trident. 
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The unstriper ASIC sends data to the egress port via 
Output Bus (BOB) (also known as DOUT_UN_bl_ch bus), which is a 64 
(or 256) bit data bus that can support either cell or packet. 

The Synchronizer has two main purposes. The first 
5 purpose is to maintain logical cell/packet or datagram ordering 
across all fabrics. On the fabric ingress interface, datagrams 
arriving at more than one fabric from one port cards 1 s channels 
need to be processed in the same order across all fabrics. The 
Synchronizer's second purpose is to have a port cards 's egress 

10 channel re-assemble all segments or stripes of a datagram that 
belong together even though the datagram segments are being sent 
from more than one fabric and can arrive at the blade's egress 
inputs at different times. This mechanism needs to be maintained in 
a system that will have different net delays and varying amounts of 

15 clock drift between blades and fabrics. 



The switch uses a system of a synchronized windows where 
start information is transmit around the system. Each transmitter 
and receiver can look at relative clock counts from the last 
resynch indication to synchronize data from multiple sources. The 

20 receiver will delay the receipt of data which is the first clock 
cycle of data in a synch period until a programmable delay after it 
receives the global synch indication. At this point, all data is 
considered to have been received simultaneously and fixed ordering 
is applied. Even though the delays for packet 0 and cell 0 caused 

25 them to be seen at the receivers in different orders due to delays 
through the box, the resulting ordering of both streams at receive 
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time = 1 is the same, Packet 0, Cell 0 based on the physical bus 
from which they were received. 

Multiple cells or packets can be sent in one counter 
tick. All destinations will order all cells from the first 
5 interface before moving onto the second interface and so on. This 
cell synchronization technique is used on all cell interfaces. 
Differing resolutions are required on some interfaces. 

The Synchronizer consists of two main blocks, mainly, the 
transmitter and receiver. The transmitter block will reside in the 
10 Striper and Separator ASICs and the receiver block will reside in 
the Aggregator and Unstriper ASICs. The receiver in the Aggregator 
will handle up to 24(6 port cards x 4 channels) input lanes. The 
receiver in the Unstriper will handle up to 13(12 fabrics + 1 
parity fabric) input lanes. 

15 When a sync pulse is received, the transmitter first 

calculates the number of clock cycles it is fast (denoted as N 
clocks) . 

The transmit synchronizer will interrupt the output 
stream and transmit N K characters indicating it is locking down. 
20 At the end of the lockdown sequence, the transmitter transmits a K 
character indicating that valid data will start on the next clock 
cycle. This next cycle valid indication is used by the receivers 
to synchronize traffic from all sources. 



At the next end of transfer, the transmitter will then 
25 insert at least one idle on the interface. These idles allow the 
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10 bit decoders to correctly resynchronize to the 10 bit serial 
code window if they fall out of synch. 

The receive synchronizer receives the global synch pulse 
and delays the synch pulse by a programmed number (which is 
5 programmed based on the maximum amount of transport delay a 
physical box can have) . After delaying the synch pulse, the 
receiver will then consider the clock cycle immediately after the 
synch character to be eligible to be received. Data is then 
received every clock cycle until the next synch character is seen 
10 on the input stream. This data is not considered to be eligible 
for receipt until the delayed global synch pulse is seen. 

Since transmitters and receivers will be on different 
physical boards and clocked by different oscillators, clock speed 
differences will exist between them. To bound the number of clock 

15 cycles between different transmitters and receivers, a global sync 
pulse is used at the system level to resynchronize all sequence 
counters. Each chip is programmed to ensure that under all valid 
clock skews, each transmitter and receiver will think that it is 
fast by at least one clock cycle. Each chip then waits for the 

20 appropriate number of clock cycles they are into their current 
sync_pulse_window. This ensure that all sources run N* 

sync_pulse_window valid clock cycles between synch pulses. 

As an example, the synch pulse window could be programmed 
to 100 clocks, and the synch pulses sent out at a nominal rate of 
25 a synch pulse every 10,000 clocks. Based on a worst case drifts 
for both the synch pulse transmitter clocks and the synch pulse 
receiver clocks, there may actually be 9,995 to 10,005 clocks at 
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the receiver for 10,000 clocks on the synch pulse transmitter. In 
this case, the synch pulse transmitter would be programmed to send 
out synch pulses every 10,006 clock cycles. The 10,006 clocks 
guarantees that all receivers must be in their next window. A 
5 receiver with a fast clock may have actually seen 10,012 clocks if 
the synch pulse transmitter has a slow clock. Since the synch 
pulse was received 12 clock cycles into the synch pulse window, the 
chip would delay for 12 clock cycles. Another receiver could seen 
10,006 clocks and lock down for 6 clock cycles at the end of the 
10 synch pulse window. In both cases, each source ran 10,100 clock 
cycles. 

When a port card or fabric is not present or has just 
been inserted and either of them is supposed to be driving the 
inputs of a receive synchronizer, the writing of data to the 

15 particular input FIFO will be inhibited since the input clock will 
not be present or unstable and the status of the data lines will be 
unknown. When the port card or fabric is inserted, software must 
come in and enable the input to the byte lane to allow data from 
that source to be enabled. Writes to the input FIFO will be 

20 enabled. It is assumed that, the enable signal will be asserted 
after the data, routeword and clock from the port card or fabric 
are stable. 

At a system level, there will be a primary and secondary 
sync pulse transmitter residing on two separate fabrics. There 
25 will also be a sync pulse receiver on each fabric and blade. This 
can be seen in Figure 5. A primary sync pulse transmitters will be 
a free-running sync pulse generator and a secondary sync pulse 
transmitter will synchronize its sync pulse to the primary. The 



-23- 



sync pulse receivers will receive both primary and secondary sync 
pulses and based on an error checking algorithm, will select the 
correct sync pulse to forward on to the ASICs residing on that 
board. The sync pulse receiver will guarantee that a sync pulse is 
5 only forwarded to the rest of the board if the sync pulse from the 
sync pulse transmitters falls within its own sequence "0" count. 
For example, the sync pulse receiver and an Unstriper ASIC will 
both reside on the same Blade. The sync pulse receiver and the 
receive synchronizer in the Unstriper will be clocked from the same 
10 crystal oscillator, so no clock drift should be present between the 
clocks used to increment the internal sequence counters. The 
receive synchronizer will require that the sync pulse it receives 
will always reside in the "0" count window. 

If the sync pulse receiver determines that the primary 
15 sync pulse transmitter is out of sync, it will switch over to the 
secondary sync pulse transmitter source. The secondary sync pulse 
transmitter will also determine that the primary sync pulse 
transmitter is out of sync and will start generating its own sync 
pulse independently of the primary sync pulse transmitter. This is 
20 the secondary sync pulse transmitter's primary mode of operation. 
If the sync pulse receiver determines that the primary sync pulse 
transmitter has become in sync once again, it will switch to the 
primary side. The secondary sync pulse transmitter will also 
determine that the primary sync pulse transmitter has become in 
25 sync once again and will switch back to a secondary mode. In the 
secondary mode, it will sync up its own sync pulse to the primary 
sync pulse. The sync pulse receiver will have less tolerance in 
its sync pulse filtering mechanism than the secondary sync pulse 
transmitter. The sync pulse receiver will switch over more quickly 
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than the secondary sync pulse transmitter. This is done to ensure 
that all receiver synchronizers will have switched over to using 
the secondary sync pulse transmitter source before the secondary 
sync pulse transmitter switches over to a primary mode. 

5 Figure 5 shows sync pulse distribution. 

In order to lockdown the backplane transmission from a 
fabric by the number of clock cycles indicated in the sync calcu- 
lation, the entire fabric must effectively freeze for that many 
clock cycles to ensure that the same enqueuing and dequeueing 
10 decisions stay in sync. This requires support in each of the 
fabric ASICs. Lockdown stops all functionality, including special 
functions like queue resynch. 

The sync signal from the synch pulse receiver is 
distributed to all ASICs. Each fabric ASIC contains a counter in 

15 the core clock domain that counts clock cycles between global sync 
pulses. After the sync pulse if received, each ASIC calculates the 
number of clock cycles it is fast. (8). Because the global sync is 
not transferred with its own clock, the calculated lockdown cycle 
value may not be the same for all ASICs on the same fabric. This 

20 difference is accounted for by keeping all interface FIFOs at a 
depth where they can tolerate the maximum skew of lockdown counts. 

Lockdown cycles on all chips are always inserted at the 
same logical point relative to the beginning of the last sequence 
of "useful" (non-lockdown) cycles. That is, every chip will always 
25 execute the same number of "useful" cycles between lockdown events, 
even though the number of lockdown cycles varies. 
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Lockdown may occur at different times on different chips. 
All fabric input FIFOs are initially set up such that lockdown can 
occur on either side of the FIFO first without the FIFO running dry 
or overflowing. On each chip-chip interface, there is a sync FIFO 
5 to account for lockdown cycles (as well as board trace lengths and 
clock skews) . The transmitter signals lockdown while it is locked 
down. The receiver does not push during indicated cycles, and does 
not pop during its own lockdown. The FIFO depth will vary 
depending on which chip locks down first, but the variation is 

10 bounded by the maximum number of lockdown cycles. The number of 
lockdown cycles a particular chip sees during one global sync 
period may vary, but they will all have the same number of useful 
cycles. The total number of lockdown cycles each chip on a 
particular fabric sees will be the same, within a bounded 

15 tolerance . 

The Aggregator core clock domain completely stops for the 
lockdown duration - all flops and memory hold their state. Input 
FIFOs are allowed to build up. Lockdown bus cycles are inserted in 
the output queues. Exactly when the core lockdown is executed is 
20 dictated by when D0UT_AG bus protocol allows lockdown cycles to be 
inserted. DOUT_AG lockdown cycles are indicated on the DestID bus. 

The memory controller must lockdown all flops for the 
appropriate number of cycles. To reduce impact to the silicon area 
in the memory controller, a technique called propagated lockdown is 
25 used. 

The on-fabric chip-to-chip synchronization is executed at 
every sync pulse. While some sync error detecting capability may 
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exist in some of the ASICs, it is the Unstriper' s job to detect 
fabric synchronization errors and to remove the offending fabric. 
The chip-to-chip synchronization is a cascaded function that is 
done before any packet flow is enabled on the fabric. The 
5 synchronization flows from the Aggregator to the Memory Controller, 
to the Separator, and back to the Memory Controller. After the 
system reset, the Aggregators wait for the first global sync 
signal. When received, each Aggregator transmits a local sync 
command (value 0x2) on the DestID bus to each Memory Controller. 

10 The Memory Controllers do not push anything into a DIN 

input FIFO until the first sync command is seen on that bus. The 
sync and every bus cycle following is constantly pushed into the 
input FIFO. On the core side of the input FIFOs, no FIFO is popped 
until a sync appears in the FIFO from every Aggregator. After two 

15 additional margin cycles, every input FIFO is popped every cycle. 
After this point the input FIFO depths remain constant. The depths 
are roughly a function of the track delays from each Aggregator. 
Immediately after the Memory Controllers begin sampling the 
Aggregator input FIFOs, a sync signal (S_SYNC_L) is transmitted to 

20 all Separators on the DOUT and CH_ID busses. 

Like the Memory Controllers, the Separators do not push 
into the DIN and CH_ID busses until a sync signal is received on 
that bus. The sync and everything after is constantly pushed into 
the input FIFO. 

25 On the core side the Separator always waits until at 

least one word is present on all input busses, and then pops the 
CH_ID and DIN busses simultaneously. This will logically align the 
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data stripes coming from the Memory Controllers. After the first 
combined sync is popped from the input FIFOs, the Separators send 
a sync signal on the TOKEN bus to the Memory Controllers. 

The striping function assigns bits from incoming data 
5 streams to individual fabrics. Two items were optimized in deriving 
the striping assignment: 



1. Backplane efficiency should be optimized for OC48 
and OC192. 

2. Backplane interconnection should not be 
10 significantly altered for OC192 operation. 



These were traded off against additional muxing legs for 
the striper and unstriper ASICs. Irregardless of the optimization, 
the switch must have the same data format in the memory controller 
for both OC48 and OC192. 



15 Backplane efficiency requires that minimal padding be 

added when forming the backplane busses. Given the 12 bit backplane 
bus for OC48 and the 48 bit backplane bus for OC192, an optimal 
assignment requires that the number of unused bits for a transfer 
to be equal to (number_of_bytes *8 ) /bus_width where V" is integer 

20 division. For OC48, the bus can have 0, 4 or 8 unutilized bits. For 
OC192 the bus can have 0, 8, 16, 24, 32, or 40 unutilized bits. 

This means that no bit can shift between 12 bit 
boundaries or else OC48 padding will not be optimal for certain 
packet lengths. 
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For OC192c, maximum bandwidth utilization means that each 
striper must receive the same number of bits (which implies bit 
interleaving into the stripers) . When combined with the same 
backplane interconnection, this implies that in OC192c, each stripe 
5 must have exactly the correct number of bits come from each striper 
which has 1/4 of the bits. 

For the purpose of assigning data bits to fabrics, a 48 
bit frame is used. Inside the striper is a FIFO which is written 32 
bits wide at 80-100 MHz and read 24 bits wide at 125 MHz. Three 32 
10 bit words will yield four 24 bit words. Each pair of 24 bit words 
is treated as a 48 bit frame. The assignments between bits and 
fabrics depends on the number of fabrics. 



TABLE 11: Bit striping function 
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The following tables give the byte lanes which are read 
first in the aggregator and written to first in the separator. The 
5 four channels are notated A,B,C,D. The different fabrics have 
different read/write order of the channels to allow for all busses 
to be fully utilized. 

One fabric-40G 



The next table gives the interface read order for the 
10 aggregator. 
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120G 
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Interfaces to the gigabit transceivers will utilize the 
transceiver bus as a split bus with two separate routeword and data 
busses. The routeword bus will be a fixed size (2 bits for OC48 
ingress, 4 bits for OC48 egress, 8 bits for OC192 ingress and 16 
5 bits for OC192 egress), the data bus is a variable sized bus. The 
transmit order will always have routeword bits at fixed locations. 
Every striping configuration has one transceiver that it used to 
talk to a destination in all valid configurations. That 
transceiver will be used to send both routeword busses and to start 
10 sending the data. 

The backplane interface is physically implemented using 
interfaces to the backplane transceivers. The bus for both ingress 
and egress is viewed as being composed of two halves, each with 
routeword data. The two bus halves may have information on 
15 separate packets if the first bus half ends a packet. 

For example, an OC4 8 interface going to the fabrics 
locally speaking has 24 data bits and 2 routeword bits. This bus 
will be utilized acting as if it has 2x (12 bit data bus + 1 bit 
routeword bus) . The two bus halves are referred to as A and B. 
20 Bus A is the first data, followed by bus B. A packet can start on 
either bus A or B and end on either bus A or B. 

In mapping data bits and routeword bits to transceiver 
bits, the bus bits are interleaved. This ensures that all 
transceivers should have the same valid/invalid status, even if the 
25 striping amount changes. Routewords should be interpreted with bus 
A appearing before bus B. 
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The bus A/Bus B concept closely corresponds to having 
interfaces between chips. 

All backplane busses support fragmentation of data. The 
protocol used marks the last transfer (via the final segment bit in 
5 the routeword) . All transfers which are not final segment need to 
utilize the entire bus width, even if that is not an even number of 
bytes. Any given packet must be striped to the same number of 
fabrics for all transfers of that packet. If the striping amount 
is updated in the striper during transmission of a packet, it will 
10 only update the striping at the beginning of the next packet. 

Each transmitter on the ASICs will have the following I/O 
for each channel: 

8 bit data bus, 1 bit clock, 1 bit control. 

On the receive side, for channel the ASIC receives 

15 a receive clock, 8 bit data bus, 3 bit status bus. 

The switch optimizes the transceivers by mapping a 
transmitter to between 1 and 3 backplane pairs and each receiver 
with between 1 and 3 backplane pairs. This allows only enough 
transmitters to support traffic needed in a configuration to be 
20 populated on the board while maintaining a complete set of 
backplane nets. The motivation for this optimization was to reduce 
the number of transceivers needed. 
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The optimization was done while still requiring that at 
any time, two different striping amounts must be supported in the 
gigabit transceivers. This allows traffic to be enqueued from a 
striping data to one fabric and a striper striping data to two 
5 fabrics at the same time. 



Depending on the bus configuration, multiple channels may 
need to be concatenated together to form one larger bandwidth pipe 
(any time there is more than one transceiver in a logical 
connection. Although quad gbit transceivers can tie 4 channels 
10 together, this functionality is not used. Instead the receiving 
ASIC is responsible for synchronizing between the channels from one 
source. This is done in the same context as the generic 
synchronization algorithm. 

The 8b/10b encoding/decoding in the gigabit transceivers 
15 allow a number of control events to be sent over the channel. The 
notation for these control events are K characters and they are 
numbered based on the encoded 10 bit value. Several of these K 
characters are used in the chipset. The K characters used and 
their functions are given in the table below. 



20 TABLE 12: K Character usage 



K character Function 

28.0 Sync indication 



28.1 
28.2 



2 5 28.3' 



28.4 



Lockdown 
Packet Abort 



Resync window 



BP set 



Notes 

Transmitted after lockdown cycles, treated as the prime 

synchronization event at the receivers 

Transmitted during lockdown cycles on the backplane 

Transmitted to indicate the card is unable to finish the 

current packet. Current use is limited to a port card 

being pulled while transmitting traffic 

Transmitted by the striper at the start of a synch 

window if a resynch will be contained in the current 

sync window 

Transmitted by the striper if the bus is currently idle 
and the value of the bp bit must be set. 
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28.5 Idle Indicates idle condition 

28.6 BP clr Transmitted by the striper if the bus is currently idle 

and the bp bit must be cleared. 

The switch has a variable number of data bits supported 
to each backplane channel depending on the striping configuration 
5 for a packet. Within a set of transceivers, data is filled in the 
following order: 

F [ fabric] [ocl92 port number] [oc48 port designation 
(a,b,c,d)] [transceiver_number] 

The chipset implements certain functions which are 
10 described here. Most of the functions mentioned here have support 
in multiple ASICs, so documenting them on an ASIC by ASIC basis 
does not give a clear understanding of the full scope of the 
functions required. 

The switch chipset is architected to work with packets up 
15 to 64K + 6 bytes long. On the ingress side of the switch, there 
are buses which are shared between multiple ports. For most 
packets, they are transmitted without any break from the start of 
packet to end of packet. However, this approach can lead to large 
delay variations for delay sensitive traffic. To allow delay 
20 sensitive traffic and long traffic to coexist on the same switch 
fabric, the concept of long packets is introduced. Basically long 
packets allow chunks of data to be sent to the queueing location, 
built up at the queueing location on a source basis and then added 
into the queue all at once when the end of the long packet is 
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transferred. The definition of a long packet is based on the 
number of bits on each fabric. 

If the switch is running in an environment where Ethernet 
MTU is maintained throughout the network, long packets will not be 
5 seen in a switch greater than 40G in size. 

A wide cache-line shared memory technique is used to 
store cells/packets in the port/priority queues. The shared memory 
stores cells/packets continuously so that there is virtually no 
fragmentation and bandwidth waste in the shared memory. 

10 There exists multiple queues in the shared memory. They 

are per-destination and priority based. All cells/packets which 
have the same output priority and blade/channel ID are stored in 
the same queue. Cells are always dequeued from the head of the 
list and enqueued into the tail of the queue. Each cell/packet 

15 consists of a portion of the egress route word, a packet length, 
and variable-length packet data. Cell and packets are stored 
continuously, i.e., the memory controller itself does not recognize 
the boundaries of cells/packets for the unicast connections. The 
packet length is stored for MC packets. 

20 The multicast port mask memory 64Kxl6-bit is used to 

store the destination port mask for the multicast connections, one 
entry (or multiple entries) per multicast VC. The port masks of the 
head multicast connections indicated by the multicast DestID FIFOs 
are stored internally for the scheduling reference. The port mask 

25 memory is retrieved when the port mask of head connection is 
cleaned and a new head connection is provided. 
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APS stands for a Automatic Protection Switching, which is 
a SONET redundancy standard. To support APS feature in the switch, 
two output ports on two different port cards send roughly the same 
traffic. The memory controllers maintain one set of queues for an 
5 APS port and send duplicate data to both output ports. 



To support data duplication in the memory controller 
ASIC, each one of multiple unicast queues has a programmable APS 
bit. If the APS bit is set to one, a packet is dequeued to both 
output ports. If the APS bit is set to zero for a port, the 
10 unicast queue operates at the normal mode. If a port is configured 
as an APS slave, then it will read from the queues of the APS 
master port. For OC48 ports, the APS port is always on the same 
OC48 port on the adjacent port card. 



The shared memory queues in the memory controllers among 
15 the fabrics might be out of sync (i.e., same queues among different 
memory controller ASICs have different depths) due to clock drifts 
or a newly inserted fabric. It is important to bring the fabric 
queues to the valid and sync states from any arbitrary states. It 
is also desirable not to drop cells for any recovery mechanism. 



20 A resync cell is broadcast to all fabrics (new and 

existing) to enter the resync state. Fabrics will attempt to drain 
all of the traffic received before the resynch cell before queue 
resynch ends, but no traffic received after the resynch cell is 
drained until queue resynch ends. A queue resynch ends when one of 

25 two events happens: 
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1. A timer expires. 

2. The amount of new traffic (traffic received after the resynch 
cell) exceeds a threshold. 



At the end of queue resynch, all memory controllers will 
5 flush any left-over old traffic (traffic received before the queue 
resynch cell) . The freeing operation is fast enough to guarantee 
that all memory controllers can fill all of memory no matter when 
the resynch state was entered. 



Queue resynch impacts all 3 fabric ASICs. The 
10 aggregators must ensure that the FIFOs drain identically after a 
queue resynch cell. The memory controllers implement the queueing 
and dropping. The separators need to handle memory controllers 
dropping traffic and resetting the length parsing state machines 
when this happens. For details on support of queue resynch in 
15 individual ASICs, refer to the chip ADSs. 



For the dequeue side, multicast connections have 
independent 32 tokens per port, each worth up 50-bit data or a 
complete packet. The head connection and its port mask of a higher 
priority queue is read out from the connection FIFO and the port 

20 mask memory every cycle. A complete packet is isolated from the 
200-bit multicast cache line based on the length field of the head 
connection. The head packet is sent to all its destination ports. 
The 8 queue drainers transmit the packet to the separators when 
there are non-zero multicast tokens are available for the ports. 

25 Next head connection will be processed only when the current head 
packet is sent out to all its ports. 
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Queue structure can be changed on fly through the fabric 
resync cell where the number of priority per port field is used to 
indicate how many priority queues each port has. 



The following words have reasonably specific meanings in 
5 the vocabulary of the switch. Many are mentioned elsewhere, but 
this is an attempt to bring them together in one place with 
definitions . 



TABLE 23: 



Word Meaning 

1 0 APS Automatic Protection Switching. A sonet/sdh standard for implementing redundancy on physical links. 

For the switch, APS is used to also recover from any detected port card failures. 
Backplane A generic term referring either to the general process the the switch boards use to account for varying transport 
synch delays between boards and clock drift or to the logic which implements the TX/RX functionality required for 

the the switch ASICs to account for varying transport delays and clock drifts. 
BIB The switch input bus. The bus which is used to pass data to the striper(s). See also BOB 

Blade Another term used for a port card. References to blades should have been eliminated from this document, but 

some may persist. 

1 5 BOB The switch output bus. The output bus from the striper which connects to the egress memory controller. See 

also BIB. 

Egress This is the routeword which is supplied to the chip after the unstriper. From an internal chipset perspective, 

Routeword tne egress routeword is treated as data. See also fabric routeword. 

Fabric Routeword used by the fabric to determine the output queue. This routeword is not passed outside the 

Routeword unstriper. A significant portion of this routeword is blown away in the fabrics. 
2 0 Freeze Having logic maintain its values during lock-down cycles. 



Lock-down Period of time where the fabric effectively stops performing any work to compensate for clock drift. If the 
backplane synchronization logic determines that a fabric is 8 clock cycles fast, the fabric will lock down for 
8 clocks. 



Queue Resynch A queue resynch is a series of steps executed to ensure that the logical state of all fabric queues for all ports is 
identical at one logical point in time. Queue resynch is not tied to backplane resynch (including lock- down) 
in any fashion, except that a lock-down can occur during a queue resynch. 

SIB Striped input bus. A largely obsolete term used to describe the output bus from the striper and input bus to the 

aggregator. 

SOB One of two meanings. The first is striped output bus, which is the output bus of the fabric and the input bus 

of the agg. See also SIB. The second meaning is a generic term used to describe engineers who left Marconi 
to form/work for a start-up after starting the switch design. 
2 5 Sync Depends heavily on context. Related terms are queue resynch, lock-down, freeze, and backplane sync. 

Wacking The implicit bit steering which occurs in the OCI92 ingress stage since data is bit interleaved among stripers. 

This bit steering is reversed by the aggregators. 
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Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustration, it is to 
be understood that such detail is solely for that purpose and that 
variations can be made therein by those skilled in the art without 
5 departing from the spirit and scope of the invention except as it 
may be described by the following claims. 
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WHAT IS CLAIMED IS : 

1. A switch for switching packets, each packet having a 
length, comprising : 

a port card which receives packets from and sends packets 
to a network; and 

fabrics connected to the port card which switch the 
packets, each fabric having a memory mechanism, each fabric having 
a mechanism for determining the length of each packet received by 
the fabric and placing a length indicator with the packet so when 
the packet is stored in the memory mechanism, the determining 
mechanism can identify from the length indicator how long the 
packet is and where the packet ends in the memory mechanism. 

2. A switch as described in Claim 1 wherein the 
determining mechanism includes an aggregator which receives packet 
fragments from the port card, determines the packet length and 
appends packet length information to the beginning of the packet in 
the length indicator. 

3. A switch as described in Claim 2 wherein the memory 
mechanism includes a memory controller, the aggregator sending the 
packet with the packet length information to the memory controller 
which stores the packet with the packet length information. 
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4 . A switch as described in Claim 3 wherein the memory 
controller has a memory which has a wide cache buffer structure in 
which multiple packets are put into one word. 

5. A switch as described in Claim 4 wherein the fabric 
includes a separator which reads the packets from the memory 
controller and extracts the packet length information from each 
packet to determine when each packet ends, and sends fragments of 
the packet to the port card. 

6. A switch as described in Claim 5 wherein the 
separator removes the packet length information from each packet 
before sending any fragments of each packet to an unstriper of the 
port card. 

7. A method for switching packets having a length 
comprising the steps of: 

receiving a packet at a port card of a switch; 

sending fragments of the packet to fabrics of the switch; 

receiving the fragments of the packet at the fabrics of 
the switch; 

measuring the length of the packet at each fabric from 
the fragments of the packet received at each fabric- 
appending a length indicator to the packet; 



storing the packet with the length indicator in a memory 
mechanism of the fabric- 



reading the packet from the memory mechanism; and 

determining where the packet ends from the length 
indicator of the packet. 

8. A method as described in Claim 7 wherein the 
receiving step includes the step of receiving the fragment at an 
aggregator of the fabric. 

9. A method as described in Claim 8 wherein the 
measuring step includes the step of measuring the length of the 
packet with the aggregator. 

10. A method as described in Claim 9 wherein the 
appending step includes the step of the appending the length 
indicator to the packet with the aggregator. 

11. A method as described in Claim 10 wherein the 
storing step includes the step of storing the packet with the 
length indicator in a memory controller of the memory mechanism. 

12. A method as described in Claim 11 wherein the 
reading step includes the step of reading the packet from the 
memory controller with a separator of the fabric. 
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13. A method as described in Claim 12 wherein the 
determining step includes the step of determining where a packet 
ends from the length indicator with the separator. 

14. A method as described and Claim 13 including after 
the determining step, there is the step of removing the packet 
length information from the separator. 

15. A method as described in Claim 14 including after 
the removing step, there is the step of sending fragments of the 
packets from the separator to the port card. 

16. A method as described in Claim 15 wherein the 
sending fragments step includes the step of sending fragments of 
the packet to the port card in a same logical time with 
corresponding fragments from other fabrics to the port card. 

17. A method as described in Claim 16 wherein the 
storing step includes the step of storing the fragments of the 
packet in a memory of the memory controller which has a wide cache 
buffer structure in which multiple packets are put into one word. 

18. A method as described in Claim 17 including after 
the reading step, there is the step of extracting the packet length 
information from the packet with a separator. 

19. A method as described in Claim 18 wherein the 
receiving step includes the step of receiving the fragments of the 
packet from the fabrics with an unstriper of the port card. 
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20. A method as described in Claim 19 wherein the 
sending fragments to the fabric step includes the step of sending 
with a striper of the port card to the aggregator of each fabric 
the fragments of the packet. 

21. A method as described in Claim 20 wherein the step 
of sending fragments to the port card includes the step of sending 
fragments from the separator to an unstriper of the port card. 
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ABSTRACT OF THE DISCLOSURE 

TRANSFERRING AND QUEUEING LENGTH AND DATA AS ONE STREAM 

A switch for switching packets. Each packet has a 
length- The switch includes a port card which receives packets 
from and sends packets to a network. The switch includes fabrics 
connected to the port card which switch the packets. Each fabric 
has a memory mechanism. Each fabric has a mechanism for 
determining the length of each packet received by the fabric and 
placing a length indicator with the packet so when the packet is 
stored in the memory mechanism, the determining mechanism can 
identify from the length indicator how long the packet is and where 
the packet ends in the memory. A method for switching packets 
having a length. The method includes the steps of receiving a 
packet at a port card of a switch. Then there is the step of 
sending fragments of the packet to fabrics of the switch. Next 
there is the step of receiving the fragments of the packet at the 
fabrics of the switch. Then there is the step of measuring the 
length of the packet at each fabric from the fragments of the 
packet received at each fabric. Next there is the step of 
appending a length indicator to the packet. Then there is the step 
of storing the packet with the length indicator in a memory 
mechanism of the fabric. Next there is the step of reading the 
packet from the memory mechanism. Then there is the step of 
determining where the packet ends from the length indicator of the 
packet . 
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TRANSFERRING AND QUEUEING LENGTH AND DATA AS ONE STREAM 



FIELD OF THE INVENTION 



The present invention is related to storing multiple 
packets within one memory word in a wide cache buffer structure . 
5 More specifically, the present invention is related to storing 
multiple packets within one memory word in a wide cache buffer 
structure by appending packet length information to a pRj^OEIVED 

FEB 1 9 2004 

BACKGROUND OF THE INVENTION 

Technology Center 2600 

In BFS Memory Controller, Wide Cache buffer structure is 
10 used where multiple packets are packed within one memory word to 
optimize buffer access bandwidth. With this, and because BFS can 
switch packets of different lengths, packet boundary information is 
lost in the wide cache buffer. If the packet boundary information 
(i.e. the packet length calculated by Aggregators) were to be sent 
15 on a different bus to Separators, (which need this to extract 
packets and send them to different Port Cards) , then the Memory 
Controllers have to take this on a bus independent of data from 
Aggregators, Queue it up independent of data, and send it out to 
Separators on a bus independent of data. Also, within the Memory 
20 Controllers data queue link lists, and length information queue 
link lists have to be synchronized. 

SUMMARY OF THE INVENTION 



The present invention pertains to a switch for switching 
packets. Each packet has a length. The switch comprises a port 
25 card which receives packets from and sends packets to a network. 



-2- 



The switch comprises fabrics connected to the port card which 
switch the packets. Each fabric has a memory mechanism. Each 
fabric has a mechanism for determining the length of each packet 
received by the fabric and placing a length indicator with the 
5 packet so when the packet is stored in the memory mechanism, the 
determining mechanism can identify from the length indicator how 
long the packet is and where the packet ends in the memory 
mechanism. 

The present invention pertains to a method for switching 
10 packets having a length. The method comprises the steps of 
receiving a packet at a port card of a switch. Then there is the 
step of sending fragments of the packet to fabrics of the switch. 
Next there is the step of receiving the fragments of the packet at 
the fabrics of the switch. Then there is the step of measuring the 
15 length of the packet at each fabric from the fragments of the 
packet received at each fabric. Next there is the step of 
appending a length indicator to the packet. Then there is the step 
of storing the packet with the length indicator in a memory 
mechanism of the fabric. Next there is the step of reading the 
20 packet from the memory mechanism. Then there is the step of 
determining where the packet ends from the length indicator of the 
packet . 

BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred embodiment of 
25 the invention and preferred methods of practicing the invention are 
illustrated in which: 
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Figure 1 is a schematic representation of packet striping 
in the switch of the present invention. 

Figure 2 is a schematic representation of an OC 48 port 

card . 

5 Figure 3 is a schematic representation of a concatenated 

network blade. 

Figure 4 is a schematic representation regarding the 
connectivity of the fabric ASICs. 

Figure 5 is a schematic representation of a 32 bit cell 

10 transfer . 

Figure 6 rs a schematic representation regarding 

back pressure. 

Figure 7 is a schematic representation of a 32 - bit packet 

transferred using external connection number bus. 

15 Figure 0 is a schematic representation of a G4 bit cell 

transferr e d. 

Figure 9 is a schematic representation of a G4 - bit packet 

transfer . 



Figure 10 is a schematic representation of ATM cell flow 

2 0 in the switch. 
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Figure [[11]] 5 is a schematic representation of sync 
pulse distribution . 

Figure — 1-2 — irs — a — schematic — representation — regarding — the- 
write cycle. 

5 Figure — 1-3 — re — a — schematic — representation — of — the — read 

cycle . 

Figure — 3-4 — m — a — schematic — representation of — the — striper 
AGIC architecture . 

Figure 15 is a schematic presentation of the aggregator 

10 AOIC architecture. 

Figure — 3-6 — irs — a — schematic — representation — of — a — memory 

controller AGIC architecture. 

Figure 17 is a schematic representation of the wide cache 

line shared memory architecture. 

15 Figure — 1-8 — rs — a — schematic — representation of a — separator 

AGIC architecture . 

Figure 19 is a schematic representation of an unstriper 

AGIC architecture . 



20 



the 
the 



Figure [[20]] 6 is a schematic representation regarding 
relationship between transmit and receive sequence counters for 
separator and unstriper, respectively. 
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Figure — 2-1 — i-s — a — schematic — representation — erf — a — receive 
synchronizer . 

Figure [[22]] 7 is a schematic representation of a switch 
of the present invention. 

5 Figure [[23]] 8 is a schematic representation of a packet 

with a length indicator. 

DETAILED DESCRIPTION 

Referring now to the drawings wherein like reference 
numerals refer to similar or identical parts throughout the several 
10 views, and more specifically to figure [[22]] 2 thereof, there is 
shown a switch 10 for switching packets 11. Each packet 11 has a 
length. The switch 10 comprises a port card 12 which receives 
packets 11 from and sends packets 11 to a network 16. The switch 

10 comprises fabrics 14 connected to the port card 12 which switch 
15 the packets 11. Each fabric 14 has a memory mechanism 18. Each 

fabric 14 has a mechanism for determining the length of each packet 

11 received by the fabric 14 and placing a length indicator 22 with 
the packet 11 so when the packet 11 is stored in the memory 
mechanism 18, the determining mechanism 20 can identify from the 

20 length indicator 22 how long the packet 11 is and where the packet 
11 ends in the memory mechanism 18. 

Preferably, the determining mechanism 20 includes an 
aggregator 24 which receives packet 11 fragments 26 from a striper 
40 of the port card 12, determines the packet 11 length and appends 
25 packet 11 length information 28 to the beginning of the packet 11 
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in the length indicator 22, as shown in figure [[23]] 8.. Referring 
again to figure 1, the memory mechanism 18 preferably includes a 
memory controller 30. The aggregator 24 sends the packet 11 with 
the packet 11 length information 28 to the memory controller 30 
5 which stores the packet 11 with the packet 11 length information 
28. Preferably, the memory controller 30 has a memory 32 which has 
a wide cache buffer structure in which multiple packets 11 are put 
into one word 34. The striper 40 preferably sends corresponding 
fragments 26 of a packet 11 to the aggregator 24 of each of the 
10 fabrics 14 during the same logical time. 

The fabric 14 preferably includes a separator 36 which 
reads the packets 11 from the memory controller 30 and extracts the 
packet 11 length information 28 from each packet 11 to determine 
when each packet 11 ends, and sends fragments 26 of the packet 11 
15 to the port card 12. Preferably, the separator 36 removes the 
packet 11 length information 28 from each packet 11 before sending 
any fragments 26 of each packet 11 to an unstriper 38 of the port 
card 12. 



The present invention pertains to a method for switching 
20 packets 11 having a length. The method comprises the steps of 
receiving a packet 11 at a port card 12 of a switch 10. Then there 
is the step of sending fragments 26 of the packet 11 to fabrics 14 
of the switch 10. Next there is the step of receiving the 
fragments 26 of the packet 11 at the fabrics 14 of the switch 10. 
25 Then there is the step of measuring the length of the packet 11 at 
each fabric 14 from the fragments 26 of the packet 11 received at 
each fabric 14. Next there is the step of appending a length 
indicator 22 to the packet 11. Then there is the step of storing 
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the packet 11 with the length indicator 22 in a memory mechanism 18 
of the fabric 14. Next there is the step of reading the packet 11 
from the memory mechanism 18. Then there is the step of 
determining where the packet 11 ends from the length indicator 22 
5 of the packet 11. 

Preferably, the receiving step includes the step of 
receiving the fragment 26 at an aggregator 24 of the fabric 14. 
The measuring step preferably includes the step of measuring the 
length of the packet 11 with the aggregator 24. Preferably, the 
10 appending step includes the step of the appending the length 
indicator 22 to the packet 11 with the aggregator 24. The storing 
step preferably includes the step of storing the packet 11 with the 
length indicator 22 in a memory controller 30 of the memory 
mechanism 18. 

15 Preferably, the reading step includes the step of reading 

the packet 11 from the memory controller 30 with a separator 36 of 
the fabric 14. The determining step preferably includes the step 
of determining where a packet 11 ends from the length indicator 22 
with the separator 36. Preferably, after the determining step 

20 there is the step of removing the packet 11 length information 28 
from the separator 36. 

After the removing step, there is preferably the step of 
sending fragments 26 of the packets 11 from the separator 36 to the 
port card 12. Preferably, the sending fragments 26 step includes 
25 the step of sending fragments 26 of the packet 11 to the port card 
12 in a same logical time with corresponding fragments 26 from 
other fabrics to the port card 12. The storing step preferably 
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includes the step of storing the fragments 26 of the packet 11 in 
a memory 32 of the memory controller 30 which has a wide cache 
buffer structure in which multiple packets 11 are put into one word 
34. 

5 Preferably, after the reading step, there is the step of 

extracting the packet 11 length information 28 from the packet 11 
with a separator 36. The receiving step preferably includes the 
step of receiving the fragments 26 of the packet 11 from the 
fabrics 14 with an unstriper 38 of the port card 12. Preferably, 

10 the sending fragments 26 to the fabric 14 step includes the step of 
sending with a striper 40 of the port card 12 to the aggregator 24 
of each fabric 14 the fragments 26 of the packet 11. The step of 
sending fragments 2 6 to the port card 12 preferably includes the 
step of sending fragments 26 from the separator 36 to an unstriper 

15 38 of the port card 12. 

In the operation of the invention, in BFS Memory 
controller 30, Wide Cache buffer structure is used to better 
utilize the limited amount of buffer access bandwidth to enqueue 
and dequeue high amounts of traffic. With this multiple packets 

20 are packed together into a wide memory word 34, and only one write 
and one read is done to the buffer for all the packets that are 
part of that word 34. Since, multiple packets are put into one 
word 34, the information about where one packet 11 ends and another 
starts have to be maintained, i.e. packet length has to be 

25 maintained. This can be done in two ways. 

In one approach, the packet 11 length can be maintained 
in a link list independent of the packet 11 data. This would 
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require that another wide cache buffer structure is used, where 
each word 34 can hold length information 28 for up to N packets, 
where N is the maximum number packets that could start in one word 
34 of the packet data buffer. With wide range of packet sizes that 
5 are supported by BFS (from 40 byte packet to 64K byte packet) , N 
has to be computed based on the smallest size packets that can be 
put together into one word 34 of the wide data buffer. But, on 
average if the packets were larger than the smallest size packet, 
then most of the memory 32 in the buffer used to store length 

10 information 28 will be wasted. Also, since packet 11 length and 
packet 11 data has to be sent at the same time from Memory 
Controllers to Separators, the link list handling length buffer has 
to be synchronized (kept lock-step) with the link list handling 
packet 11 data buffer. This approach also requires to have a 

15 separate bus for transferring the length information 28 
independently across multiple physical devices. 

In the operation of the invention, by appending packet 11 
length information 28 to the beginning of a packet 11 in the 
Aggregators, and sending it as one stream through Memory 

20 Controllers and to Separators, Memory Controllers doesn't have to 
handle packet 11 length and packet 11 data separately. This helps 
Memory controller 30 design, as it doesn't have to maintain 
separate data and length link lists for same packet 11 stream. 
Also, this saves lot of memory 32, as because of the wide cache 

25 memory approach for data, length memory also has to be similarly 
implemented. With different packet 11 sizes, on average, lot of 
this length memory could be wasted. 
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In a preferred approach, packet 11 length information 28 
is queued and transmitted along with the packet 11 data. In this 
case, packet 11 length information 28 (which is always a fixed 
number of bits) is attached to the beginning of each packet 11 
5 data. Then this entity containing both packet 11 length and data 
is queued as one. Even in this approach, multiple packets are 
packed together in each word 34 of the wide cache buffer. When 
packets are read from this buffer and sent to Separator ASIC, it 
will extract the length field for the first packet 11, and based on 

10 that decides where the first packet 11 ends in the data stream, as 
the length field gives the number of bits of packet 11 length and 
data together. The bit after the end of the first packet 11 data 
is the start bit of the second packet 11 length field. Then again 
based on this length value end of the second packet 11 data, and 

15 start of third packet 11 length is extracted. This way, all 
packets can be extracted from the combined stream. 

The switch uses RAID techniques to increase overall 
switch bandwidth while minimizing individual fabric bandwidth. In 
the switch architecture, all data is distributed evenly across all 
20 fabrics so the switch adds bandwidth by adding fabrics and the 
fabric need not increase its bandwidth capacity as the switch 
increases bandwidth capacity. 

Each fabric provides 40G of switching bandwidth and the 
system supports 1, 2, 3, 4, 6, or 12 fabrics, exclusive of the 
25 redundant/spare fabric. In other words, the switch can be a 40G, 
80G, 120G, 160G, 240G, or 480G switch depending on how many fabrics 
are installed. 
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A portcard provides 10G of port bandwidth. For every 4 
portcards, there needs to be 1 fabric. The switch architecture 
does not support arbitrary installations of portcards and fabrics. 



5 whole, the switch takes a "receiver make right" approach where the 
egress path on ATM blades must segment frames to cells and the 
egress path on frame blades must perform reassembly of cells into 
packets . 

There are currently eight switch ASICs that are used in 
10 the switch: 



The fabric ASICs support both cells and packets. As a 
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1. 



Striper - The Striper resides on the portcard and 
SCP-IM. It formats the data into a 12 bit data 
stream, appends a checkword, splits the data stream 
across the N, non-spare fabrics in the system, 
generates a parity stripe of width equal to the 
stripes going to the other fabric, and sends the 
N+l data streams out to the backplane. 



20 



2. 



Unstriper - The Uhstriper is the other portcard 
ASIC in the the switch architecture. It receives 
data stripes from all the fabrics in the system. It 
then reconstructs the original data stream using 
the checkword and parity stripe to perform error 



detection and correction. 
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3. 



Aggregator - The Aggregator takes the data streams 
and routewords from the Stripers and multiplexes 
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th em into a single input stream to the Memory 
Controller . 

4. Memory Controller - The Memory controller 
implements the queueing and dequeueing mechanisms 

5 of the switch. This includes the proprietary wide 

memory interface to achieve the simultaneous en- 
/de-queueing of multiple cells of data per clock 
cycle. The dequeueing side of the Memory Controller 
runs at 80Gbps compared to 40Gbps in order to make 
10 the bulk of the queueing and shaping of connections 

occur on the portcards. 

5. Separator - The Separator implements the inverse 
operation of the Aggregator. The data stream from 
the Memory Controller is demultiplexed into 

15 multiple streams of data and forwarded to the 

appropriate Unstriper ASIC. Included in the 

interface to the Unstriper is a queue and flow 
control handshaking . 

6i Trident Trident is, — strictly speaking, — not one of 

2 0 the ASICs. — It is actually one - half of the Poseidon 

chipset . — Trident will be used to implement the ATM 
portcards within the switch. 
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Vortex Vortex — irs — the — partner — to Trident — in — t+re 

Poseidon — chipset . — Vort e x — irs — the — ingress — ASIC — arrd 
Trident the e gress device. — Together, — the two chips 
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implement — a — 2 . 5Gbps — ingress, — 5Gbps — egress — system 
capable of supporting up to OC"40c ports. 

6h Reassembler i L he — Reassembler — ASIC — irs — t+re — frame 

blade equivalent to Trident. — It will be capable of 

5 taking cell streams from t+re Unstriper arrd 

converting them into frames. 

There are 3 different views one can take of the 
connections between the fabric: physical, logical, and "active." 
Physically, the connections between the portcards and the fabrics 

10 are all gigabit speed differential pair serial links. This is 
strictly an implementation issue to reduce the number of signals 
going over the backplane. The "active" perspective looks at a 
single switch configuration, or it may be thought of as a snapshot 
of how data is being processed at a given moment. The interface 

15 between the fabric ASIC on the portcards and the fabrics is 
effectively 12 bits wide. Those 12 bits are evenly distributed 
("striped") across 1, 2, 3, 4, 6, or 12 fabrics based on how the 
fabric ASICs are configured. The "active" perspective refers to the 
number of bits being processed by each fabric in the current 

20 configuration which is exactly 12 divided by the number of fabrics. 

The logical perspective can be viewed as the union or max 
function of all the possible active configurations. Fabric slot #1 
can, depending on configuration, be processing 12, 6, 4, 3, 2, or 
1 bits of the data from a single Striper and is therefore drawn 
25 with a 12 bit bus. In contrast, fabric slot #3 can only be used to 
process 4, 3, 2, or 1 bits from a single Striper and is therefore 
drawn with a 4 bit bus. 
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Unlike previous switches, the switch really doesn't have 
a concept of a software controllable fabric redundancy mode. The 
fabric ASICs implement N+l redundancy without any intervention as 
long as the spare fabric is installed. 

5 As far as what does it provide; N+l redundancy means that 

the hardware will automatically detect and correct a single failure 
without the loss of any data. 

The way the redundancy works is fairly simple, but to 
make it even simpler to understand a specific case of a 120G switch 

10 is used which has 3 fabrics (A, B, and C) plus a spare (S) . The 
Striper takes the 12 bit bus and first generates a checkword which 
gets appended to the data unit (cell or frame) . The data unit and 
checkword are then split into a 4-bit-per-clock-cycle data stripe 
for each of the A, B, and C fabrics (A 3 A 2 A 1 A 0 , B 3 B 2 B 1 B 0 , and C 3 C 2 C 1 C 0 ) . 

15 These stripes are then used to produce the stripe for the spare 
fabric S 3 S 2 S 1 S 0 where S n = A n XOR B n XOR C n and these 4 stripes are 
sent to their corresponding fabrics. On the other side of the 
fabrics, the Unstriper receives 4 4-bit stripes from A, B, C, and 
S. All possible combinations of 3 fabrics (ABC, ABS, ASC, and SBC) 

20 are then used to reconstruct a "tentative" 12-bit data stream. A 
checkword is then calculated for each of the 4 tentative streams 
and the calculated checkword compared to the checkword at the end 
of the data unit. If no error occurred in transit, then all 4 
streams will have checkword matches and the ABC stream will be 

25 forwarded to the Unstriper output. If a (single) error occurred, 
only one checkword match will exist and the stream with the match 
will be forwarded off chip and the Unstriper will identify the 
faulty fabric stripe. 
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For different switch configurations, i.e. 1, 2, 4, 6, or 
12 fabrics, the algorithm is the same but the stripe width changes. 

If 2 fabrics fail, all data running through the switch 
will almost certainly be corrupted. 

5 There are basically two options, — both requiring that the 

defective fabrics be known through some means. Unfortunately, — am- 

a double failure system, — the hardware that detects and identifies 
a — failed — fabric — will — only — be — able — to — identify — the — fabric — that 
failed — first — Hrf — there — wa-s — one) . — Identifying — both — the — failed 
10 fabrics — may only be — possible — through — a — trial - and error — approach 
unless the switch software and/or switch diagnostics can develop 
tests to identify the second failure. — 

The recomm e nded approach would be to shut down the switch 

and install as many good fabrics as possible beginning with slot 1. 
15 This allows the maximum bandwidth and redundancy be available given 
the functional hardware available. 

IHre — other — option — k — to — have — the — switch — software 

reconfigure — the — switch to use — fewer — fabrics . — This — ±-s — sm — inferior 
solution for two reasons : 

2 0 ii ft — e«m — never — provide — more — bandwidth — than — the 

recommended approach . 

zh It requires — substantial thought and understanding 

of — the — switch — by — the — treer — in — order — to — determine 
what is the maximum operational configuration. 
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Basically , — the user must start at fabric slot 1 and counL 

the — number — erf — operational — fabrics . ff — the — spare — fabric — » 

operational, — then — it — may — foe — used — t-o — "cover" — fxrr — the — first — non - 
operational fabrics . 

5 Exam p l e — — A r e dundant 240G — s w it c h — (6 + 1 — fabrics ^ — has — suff e r e d 
fab r i c failur e s in sl o ts 3 and 4 . — Starting with slot 1 ther e ar e 2 
operational fabrics and the spare is available to cover for slot 3. 
This switch can be reconfigured to a 120G non-redundant switch or 
an 0 0G redundant switch. Note than by swapping fabric 5 and G into 
10 slots 3 and 4, — this switch could be a 1G0G redundant switch. 

Exam p l e — #2i — ft — redundant — 400G — s w itch — suff e rs — f abric — failur e s — in 

sl o ts 1 and th e spar e . Start swapping fabrics. — Slot 1 is d e ad and 

the spare is not available to cover for it. — This is the worst case 
scenario . 

15 Exam p l e — #3-: — A r e dundant — 400G — swit c h — suff e rs — fabric — failur e s — in 
sl o ts 2 and 10 . — Th e r e is on e functional fabric counting from slot 
1 or 0 if the spare is used to cover for slot 2. This switch can be 
configured either as 40G redundant or 240G non redundant. Note that 
fabrics 7,0, — and 9 do not help since the only legal configuration 

2 0 after G fabrics is all 12. 

The fabric slots are numbered and must be populated in 
ascending order. Also, the spare fabric is a specific slot so 
populating fabric slots 1, 2, 3, and 4 is different than populating 
fabric slots 1, 2, 3, and the spare. The former is a 160G switch 
25 without redundancy and the latter is 120G with redundancy. 



-17- 



Firstly, the ASICs are constructed and the backplane 
connected such that the use of a certain portcard slots requires 
there to be at least a certain minimum number of fabrics installed, 
not including the spare. This relationship is shown in Table 0. 

5 In addition, the APS redundancy within the switch is 

limited to specifically paired portcards. Portcards 1 and 2 are 
paired, 3 and 4 are paired, and so on through portcards 47 and 48. 
This means that if APS redundancy is required, the paired slots 
must be populated together. 

To give a simple example, take a configuration with 2 
portcards and only 1 fabric. If the user does not want to use APS 
redundancy, then the 2 portcards can be installed in any two of 
portcard slots 1 through 4. If APS redundancy is desired, then the 
two portcards must be installed either in slots 1 and 2 or slots 3 
and 4 . 



Portcard 


Minimum 


Slot 


# of 




Fabrics 


1-4 


1 


5-8 


2 


9-12 


3 


13-16 


4 


17-24 


6 


25-48 


12 



Table 0: Fabric Requirements for Portcard Slot Usage 

25 To add capacity, add the new fabric (s), wait for the 

switch to recognize the change and reconfigure the system to stripe 
across the new number of fabrics. Install the new portcards. 
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Note that it is not technically necessary to have the 
full 4 portcards per fabric. The switch will work properly with 3 
fabrics installed and a single portcard in slot 12. This isn't cost 
efficient but it will work. 

5 To remove capacity, reverse the adding capacity 

procedure . 

If the switch is oversubscribed, i.e. install 8 portcards 
and only one fabric. 

It should only come about as the result of improperly 
10 upgrading the switch or a system failure of some sort. The reality 
is that one of two things will occur, depending on how this 
situation arises. If the switch is configured as a 40G switch and 
the portcards are added before the fabric, then the 5 th through 8 th 
portcards will be dead. If the switch is configured as 80G non- 
15 redundant switch and the second fabric fails or is removed then all 
data through the switch will be corrupted (assuming the spare 
fabric is not installed) . And just to be complete, if 8 portcards 
were installed in an 80G redundant switch and the second fabric 
failed or was removed, then the switch would continue to operate 
20 normally with the spare covering for the failed/removed fabric. 

The switch includes the following features: 

Scales from 40Gbps to 4Q0Gbp3 (40, 00, 120, 1G0, 240, 400 

GD/sec are the supported configurations) . 



Gwitches ATM cells and variable-length packets 
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N i l fabric redundancy with error detection and recovery 

supported in the ASIC chipset 1 : — 

- Native APS support 

Support up to 19GK cell shared memory , — 921GK unicast and 

5 G4K multicast connections. 

Support 2 ' A port speed for fabric dequeueing — (2.5 GD/sec 

irrtT — 5 GD/sec out for each OC40 port) . 

Supports both OC40c ports and OC192c ports. 

Provides — port /priority — queuing — similar — to — past — switch 

10 fabrics . — Four priorities are provided for 4 0 120 GD/sec 

switches, — 2 priorities/port for 240 GD/sec switches and 
1 priority for 400 GD/sec switches. 

ASICs utilize 250 MHz I10TL point to point busses between 

fabric ASICs and interface with the backplane using stan 
15 dard GDit transceivers. 

Interface — to — ptrrt — cards — chips — tree — 00 125 — MUr — LVTTL 

signals . 

Support output port supplied back ' pr e ssure . 



L E L he — significant — architectural — difference — between — the 

2 0 switch — and past — switches — is — that — incoming — traffic — rs — routed — to- 
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multiple — switch — fabrics . — Each — fabric — is — designed — to — enqueue — 4-6- 
GD/sec of data and dequeue 00 GD/sec of data. — As data comes — into 
the switch, — it is broken up on a bit by bit basis and part of each 
packet is sent to each fabric in th e box. The fabrics will all make 
5 the same enqueuing and drop decisions , — and all — schedule fragments 
of a packet/cell at the same time. Each fabric sends its portion of 
the packet — ©r — cell — to the output port — card which reassembles — the 
fragment — into — the — complete — cell/packet — which — ars — then passed to a 
shared memory AOIC for per port storage and scheduling. The XOR of 

10 t+re — data — sent — to — each — fabric — — sent — to — a — spare — fabric . — frt — the 
event of a — fabric failure, — that fabrics data can be recovered by 
utilizing — -ttre — good — data — bits — and — t-he — parity — fabric — bits — to 

recalculate — any — fabrics — data . [ f L he — striping — erf — data — to — fabrics 

happens on the basis of 40 bit chunks. — This allows the switch to 

15 support 1,2,3,4,0 and 12 fabrics. 

Five — ASICs — build — the — switching — functionality — for — fe+re 

switch . — These AGICs are described briefly below. 



TABLE 1 : The switch A0IC3 





Function 


Striper 


Takes incoming ecll from Vortex (or OC192c equivalent) or from POS input stage and breaks the data up 


into the appropriate chunks to go to each fabric, calculates the parity for the spare fabric, concatenates a 


checksum onto the paekct, separates the routeword and data into separate routeword and data busses whieh 
run across the backplane. 


Aggregator 


Receives separate data and routeword busses from multiple stripers. Converts from the reasonably slim 
dedicated striper^ Aggregator busses to a wide shared bus to the memory controllers. 


Memory 
Controllers 


Actually perform the queueing of data for the fabrics. Queues the cell into one of 200 queues (192 UC queues 
4 MC queues and 4 control port queues). — All drops whieh occur in the chipset occur here. 


Separator 


Combines traffic from multiple memory controllers to one fabric output. Provides rate control of the streaiv 
of data leaving the fabric for each OC46 or OC192c port. 






any fabrie and attempts to reconstruct the good data. Passes the data to the output memory controller. If the 
striper is on an ATM blade and the data is a paekct, it is segmented before passing onto the ATM controller. 



Figure 1 shows packet striping in the switch. 
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The chipset supports ATM and POS port cards in both OC4 8 
and OC192c configurations. OC48 port cards interface to the 
switching fabrics with four separate OC48 flows. OC192 port cards 
logically combine the 4 channels into a 10G stream. The ingress 
5 side of a port card does not perform traffic conversions for 
traffic changing between ATM cells and packets. Whichever form of 
traffic is received is sent to the switch fabrics. The switch 
fabrics will mix packets and cells and then dequeue a mix of 
packets and cells to the egress side of a port card. 

10 The egress side of the port is responsible for converting 

the traffic to the appropriate format for the output port. This 
convention is referred to in the context of the switch as "receiver 
makes right". A cell blade is responsible for segmentation of 
packets and a cell blade is responsible for reassembly of cells 

15 into packets. To support fabric speed-up, the egress side of the 
port card supports a link bandwidth equal to twice the inbound side 
of the port card. — For each OC40 interface, — the unstriper supports 
a bandwidth of GGD/sec and for each OC192 interface, a bandwidth of 
24 GD/sec — (combined routeword — I — data) . 

20 The block diagram for a Poseidon-based ATM port card is 

shown as in Figure 2. Each 2 . 5G channel consists of 4 ASICs: Vortex 
Inbound TM and striper ASIC at the inbound side and unstriper ASIC 
and Trident outbound TM ASIC at the outbound side. 

At the inbound side, the Vortex ASIC aggregat e s 1 OC-48c 
25 or 4 OC-12c interfaces are aggregated . Each vortex sends a 2 . 5G 
cell stream into a dedicated striper ASIC (using the BIB bus, as 
described below) . The striper converts the vortex — supplied 
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routeword into two pieces. A portion of the routeword is passed to 
the fabric to determine the output port(s) for the cell. The 
entire routeword is also passed on the data portion of the bus as 
a routeword for use by the outbound memory controller. The first 
5 routeword is termed the "fabric routeword". The routeword for the 
outbound memory controller is the "egress routeword". 

At the outbound side, the unstriper ASIC in each channel 
takes traffic from each of the port cards, error checks and correct 
the data and then sends correct packets out on its output bus. The 

10 unstriper uses the data from the spare fabric and the checksum 
inserted by the striper to detect and correct data corruption. The 
DGbps — traffic — ±-s — then — sent — to — the — Trident — AGIC — erf — t+re — Poseidon 
chipset. The Trident ASIC stores the incoming cells based on per-VC 
queues — and — sends — them — otrt — to — 6€ 12c/OC 4 0c — interfaces — srb 

15 aggregated speed of 2.5Gbps. 

For the FOG interfaces, the striper AGIC input bus speeds 

up — to — 3 . 2Gbps — to — handle — P6S — overhead. — 54°re — outbound — side, — the 
unstriper — talks — to — a — reass e mbly — stag e — which — rs — currently — being 
defined . 

20 Figure 2 shows an OC48 Port Card. 

The OC192 port card supports a single 10G stream to the 
fabric and between a 10G and 20G egress stream. This board also 
uses 4 stripers and 4 unstriper, but the 4 chips operate in 
parallel on a wider data bus. The data sent to each fabric is 
25 identical for both OC48 and OC192 ports so data can flow between 
the port types without needing special conversion functions. 
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Figure 3 shows a 10G concatenated network blade. 

Each 40G switch fabric enqueues up to 40Gbps cells/frames 
and dequeue them at 80Gbps. This 2X speed-up reduces the amount of 
traffic buffered at the fabric and lets the outbound ASIC digest 
5 bursts of traffic well above line rate. A switch fabric consists of 
three kinds of ASICs: aggregators, memory controllers, and 
separators. Nine aggregator ASICs receive 40Gbps of traffic from up 
to 48 network blades and the control port. The aggregator ASICs 
combine the fabric route word and payload into a single data stream 
10 and TDM between its sources and places the resulting data on a wide 
output bus. An additional control bus (destid) is used to control 
how the memory controllers enqueue the data. The data stream from 
each aggregator ASIC then bit sliced into 12 memory controllers. 

The memory controller receives up to 16 cells/frames 
15 every 2D0MIIz clock cycle. Each of 12 ASICs stores 1/12 of the 
aggregated data streams. It then stores the incoming data based on 
control information received on the destid bus. Storage of data is 
simplified in the memory controller to be relatively unaware of 
packet boundaries (cache line concept) . All 12 ASICs dequeue the 
20 stored cells simultaneously at aggregated speed of 80Gbps. 

Nine separator ASICs perform the reverse function of the 
aggregator ASICs. Each separator receives data from all 12 memory 
controllers and decodes the routewords embedded in the data streams 
by the aggregator to find packet boundaries. Each separator ASIC 
25 then sends the data to up to 24 different unstripers depending on 
the exact destination indicated by the memory controller as data 
was being passed to the separator. 
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The dequeue process is back-pressure driven. If 
back-pressure is applied to the unstriper, that back-pressure is 
communicated back to the separator. The separator and memory 
controllers also have a back-pressure mechanism which controls when 
5 a memory controller can dequeue traffic to an output port. 

In order to support OC48 and OC192 efficiently in the 
chipset, the 4 OC48 ports from one port card are always routed to 
the same aggregator and from the same separator (the port 
connections for the aggregator & Sep are always symmetric). The 

10 table below shows the port connections for the aggregator & sep on 
each fabric for the switch configurations. Since each aggregator 
is accepting traffic from 10G of ports, the addition of 40G of 
switch capacity only adds ports to 4 aggregators. This leads to a 
differing port connection pattern for the first four aggregators 

15 from the second 4 (and also the corresponding separators) . 



TABLE 2: Agg/Sep port connections 



Switch Size 


Agg 1 


Agg 2 


Agg 3 


Agg 4 


Agg 5 


Agg 6 


Agg 7 


Agg 8 


40 


1,2,3,4 


5,6,7,8 


9,10,11,12 


13,14,15, 16 










80 


1,2,3,4 


5,6,7,8 


9,10,1 1,12 


13,14,15, 16 


17,18,19, 20 


21,22,23,24 


25,26,27, 28 


29,30,31,32 


120 


1,2,3,4 


5,6,7,8 


9,10,1 1,12, 


13,14,15, 16, 


17,18,19, 20 


21,22,23,24 


25,26,27, 28 


29,30,31,32 




33,34,35,36 


37,38,39,40 


41,42,43,44 


45,46,47, 48 










160 


1,2,3,4 


5,6,7,8 


9,10,11,12, 


13,14,15, 16, 


17,18,19, 20, 


21,22,23,24, 


25,26,27, 28, 


29,30,31,32, 




33,34,35, 36 


37,38,39,40 


41,42,43,44 


45,46,47, 48 


49,50,51,52 


53,54,55, 56 


57,58,59, 60 


61,62,63,64 



Figure 4 shows the connectivity of the fabric ASICs. 

The external interfaces of the switches are the Input Bus 
(BIB) between the striper ASIC and the ingress blade ASIC such as 
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Vortex and the Output Bus (BOB) between the unstriper ASIC and the 
egress blade ASIC such as Trident. 

Two variations of routewords are supported. Fhe — first 

option — uses — one — 32 — ferrfe — routeword which — i-s — pass e d — bo — t+re — egress 
5 board as the egress routeword and has fields extracted to form the 

fabric routeword . The second option allows the striper to accept 

both — a — fabric — routeword — (which — happens — on — a — dedicated — routeword 
bus ) — and an egress routeword — (which is received on the data bus). 
^Phe — second option — is more — flexible on connection — space — usage — and 
10 expansion since that allows all 32 bits of the routeword to be used 
to identify connections on switch egress. 

To maintain compatibility with Vortex, — bit 24 — irs — still 

maintained as the multicast bit. The incoming routeword has the 

following format . — 



15 TABLE 3 : 32 ■ bit DID/DOD rout e word format 



bi t 30:25 



bi t 23:0 



Connec t ion ID(29 : 28) & 



Mul t icas t Bi t 



C o nnecti o n ID (27:20) & e on nee t ion ID (15 : 0) 



Connec t io n ID(19:16) 



The 2G bit conn ID in the routeword is set to 

2 0 MC bit & Connection ID — (29:5) — for UC connections which are 

rroit — special — routeword values 

MC bit & Connection ID — (24 : 0) — for MC connections or for 

special routeword unicast values. 
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For UC connections, — although bits 29 :D are passed to the 

fabric, — only bits 29 : 20 are used. — These bits should be programmed 

with queue to be used. Dits 29 : 20 — should be programmed with the 

priority and bits 27 : 20 programmed with the queue number. 



5 Note — that — the — RW — value — used — for — ttre — outbound — memory 

controller is set to 

-4^ — & MC bit — & connection ID (29 : 0). 



If the fabric is using 10 bits of conn ID, — this leaves 20 

bits — (1 M connections) — for use by the outbound memory controller. 



For — double — routewords, — no — manipulation 


■irs — done . 


-^Fher 


value passed in on the routeword bus needs to equal 




tion — f-B — fe-o — be — transmitted — cm — the — backplane . — The 
tables — show — the — routeword — value — which — should — foe- 


J_ U J. _L U W JL 1 i y 

passed — orr- 


LWU 

the 



backplane routeword bus. 



15 TABLE 4 : Unicast Conn e ction ID for s e parat e RW bus 



LMl jLJ 


bit 24:23 


Bit 22:15 












Multicast bit-0 


Fabric priority 


Fabric queue ID 


Future expansion bits. This bits arc 







TADLE 5; Multicast Connection ID for separate RW bug 



LMl L J 



bi t 24 : 23 



bi t 22:16 



bi t 15:0 



20 Multicas t b it- 



P r io rit y queue ID 



Reserved. No t e t hese bi t 3 mul t icas t connec t i o n ID (0 te 



urc se nt t o t he fabrie to 6 4 1C) used by t he fab ri c 
allow — future fabries — to 



supp ort mo r e connection 
spacer 
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Special — routewords are flagged by using reserved queue 

numbers — (those in the range of 240 - 2DD) . These routeword values 

indicate the receipt of an OAM cell which must get routed to the 

control port or a queue resynch operation. These — special values 

5 are always expressed in terms of the connection ID which goes to 

t-he — fabric . If special — routewords — are — given to the — fabric, — the- 

memory — controller — routeword must — also — be — modified — ±-£ — these — a-re 
getting passed in using the separate connection number bus. 



ttre — routeword — passed — t-o — the — fabric — will — contain — the 

10 multicast bit and the port mask bits — (bits 23 : 1G) . The routeword 

passed — tx> — the — outbound memory — controller — will — maintain — the — port 
mask and also contain the vortex ID and the port ID. 



¥he — connection — ID of an OAM cell — hers — a — special — format 

generated by the Vortex ASIC: 



15 TABLE 6 : Connection ID for OAM c e ll 



i 



\JWitJ 



B it 2 4 :23 



i t 22:15 



bi t 14:9 



bit 7:0 



Multicast b it- 0 



Vo rt ex ID (7 r6) 



xFO (hex) 



Vo rt ex ID (5:0) r ese r ved 



Po rt ID 



fHte — Vortex — t~B — field — ars — used — to — indicate — which — sourc e 

Vortex AOIC the cell comes from. The port ID indicates which port 

2 0 the cell comes from inside the Vortex ABIC. Note that OAM cells ar e 
a-i-i — unicast . — All OAM cells are destined to one of — 19G blade and 
control — port — qu e ues — programm e d — by — a — 0 - bit — 6AM — cell — destination 

register — in the memory controller AGICs. If separate routeword 

busses — srre — being — used, — bit — 24 : 1G — of — the — DID_CONN — field will — be 

25 passed to the fabric. — The rout e word which appears on the data bus 
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( memory controller routeword) — should include the port mask, — vortex 
ID and port ID fields in bits 23:0. — The value in the multicast bit, 
is a don't care for the memory controller routeword. 

Fabric queue ID 0xF0-0xF7 of the unicast connection ID is 

5 reserved for software use. All packets which have the fabric queue 
ID in range of OxFO OxFF will be redirected to one of the 4 control 
port queues based on a programmable register. 

¥he — connection — f© — of — a — resync — cell — fra^s — t-fre — following 

format . — ■Pfre — resync — cell — is — used — to — resynchronize — qu e ues — in — t+re 
10 memory controller AOICs. — Fabric queue ID OxFO - OxFF of the unicast 
connection ID is reserved for special fabric functions. 



TABLE 7 : Conn e ction ID for Resync c e ll 



bi t 22 : 15 



b i t 14:13 



bi t 12:0 



Multicast bit - 0 



Prio ri ty (unused) QxFr (hex) 



Numbe r — ©H Rese r ved 



p r i o riti es p e r p o rt 



15 f4re — number — of — priority — queues — per — port — cem — only — be 

changed — during — the — queue — resync period, — i.e. f — when — a — fabric — » 
removed or — ins e rted as — follows : 



6-0-: — orre — priority per port — for — 400G switch, — pick bit — k& 

down to 0 — of the connection ID as the queue ID; 
2 0 01; two priorities per port for 240G switch, pick bit 1G 

down to 9 of the connection ID as the queue ID; 
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10 : — 4 — priorities — per — port — for — 120G — err — smaller switch, 
pick bit 17 down to 10 of the connection ID as the queue 

- Hr-r — reserved 

5 The resync cell can also be used to copy the shadow data 

register — to — a — valid — location — where — the — shadow — address — register 
points to. 

Shadow — control — cell — is — used — to — copy — the — shadow — data 

register — to — a — valid — location — wh e r e — the — shadow — address — register 
10 points to. — The connection ID of a shadow control cell use. 



TABLE 0 : Conn e ction ID for Shadow Control C e ll 



hit ? a 




bit 22:15 


bit 14:0 


Multicast bit— 0 


Priority 


OxFC(hex) 


Reserved 



Data coming into the DID bus and out of the DOD bus — r-s- 

15 assumed to be filled onto the busses from most significant bit to 
least significant bit — (highest number bit to lowest number bit) . 

The Striper ABIC accepts data from the ingress port via 

the Input Dus — (DID) — (also known as DIN_3T_bl_ch bus) . 

This bus — can either operate — srs — 4 — separate — 32 bit — input 

2 0 buses — ( 4xOC40c) — or a single 120 bit wide data bus with a common set 
of control lines to all stripers. This bus supports either cells or 
packets — based on — software — configuration of — the — strip e r — chip . — ft- 
consists of the following signals : 
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DID_Clock : — This clock is sourced by the Otripcr AOIC at 
tip — to — 100 MHz and is — used as — a — reference — fxrr — data — arrd 
control signals on the DID. 



- DID_DP : — This — signal — m — asserted — (low) — to — indicate the 

5 striper — AOIC — cannot — take — data — cm — the — btrs — dtte — to — a- 
bandwidth — difference — between — Hre — BiB — and — SiB — busses . 
Interfaces which — run below — 93 MHz — will — never — see this 
signal asserted. — At 100 Mhz, — this signal is asserted if 
mor e — than — G553G bytes — of back to back data — are — given. 

10 This — signal — should be — sampled at — t+ns — start — erf — packet . 

During a pack e t transfer, this signal will be asserted if 
the FIFO conditions would cause DP if the packet ended on 

t-h-e — current — clock — cycle . 3H£ — frP — irs — asserted — t-he — clock 

cycle after the EOF, the striper will effectively ignore 

15 the input bus until the DP indication is withdrawn. The 

packet ingress stage should repeat the first word of the 
next packet transfer and then proc e ed with the rest of 
the packet after the DP signal goes away. 



DID_Valid_L : This active low input signal delimits valid 

2 0 data on — the DID_DOP, — DID_EOP / — and DID_DATA buss e s. — ff- 

this — signal — irs — active, — the — busses — srre — assumed — to — be 
valid. — If high, — the busses are treated as having invalid 
data for the current clock cycle. — If a transfer is not in 
progress — (-iro — SOP — without — EOF has — been — given) — then — the 
25 data bus — is tr e ated as invalid even if this signal — is a 

on e . — For cell interfaces, — this signal can be tied active. 
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DID_Cell_Pkt : — This signal is set to a one to indicate a 
cell transfer and a zero to indicate a packet transfer. 
Signal needs to be valid the same clock cycle as start of 
cell . 

5 DID_Data [127 : 0] : This is the input 120 bit data bus. If 

running in 32 bit mode, — a cell consists of a 4 byte RW, 
a 4 byte Header, — and twelve 4 byte data words. — A packet 
has a RW and N data words, where 1 ^ N. — If running in 120 
bit mode, — a cell has a 4 byte RW, — a 4 byte header, — and 0 

10 bytes of data in the first word, — 2 words with 1G bytes of 

data, — and a final word with 0 bytes of data, — if the data 
starts on a word boundary. A following cell can start on 
the half - word boundary and have all — fields offset by 0 
bytes . — Packets in 120 bit mode work in the same fashion 

15 as 32 bit mode, — except that EOF and OOP can have larger 

values. Minimum packet length supported is 1G bytes. ff- 

half word — boundary — cell — starts — srre — used, — the correct 
value — ( 0 / 4 ) — needs to be given on the OOP bits — 3:0. 

- DID EOF [4:0] : This bus has two fields. Dit 4 l s a one to 

2 0 indicate an EOF on the current transfer — (if DID_Valid_L 

-i-s — active) . — B±t — 4 — ±-s — a — zero — to — indicate — rro — BOP — cm — the- 
current transfer . — Dits 3 : 0 give the offset of the last 

byte which is valid. The EOP field is not utilized for 

cell transfers. 

2 5 DID_0OP/C [1 : 0) : This bit indicates a start of packet or 

cell on the current bus cycle (if DID_Valid_L is active) . 
A value of z e ro indicates start of transfer, — a value of 
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one indicates no start of transfer. — Asserting bit 1-1 
indicates — that — t-he — upper — G4 — bits — carries — fe+re — SOP — smd 
asserting — ferarfe — — indicates — that — Hre — lower — G4 — bits 

carries the GOP — (for 120 bit bus only) . For the 32 bit 

5 bus, OOF(0) — should be used, OOP(l) should be tied high. 

-Per — the — 3-2-9 — tert — bus , — ±-£ — a — packet — ends — im — t-he — upper — 6-4- 
bits of the bus, — a new packet can begin at bit — 

DIDJCONN (24 : 0) : This is an optional bus. It can be used 

to pass — a — routeword to the — striper AOIC — to — tree — srs — t+re- 

10 fabric routeword, — or the routeword can b e transferred as 

the most significant 32 bits of the first word of data. 

The data should be valid the same cycle as OOP/C. 54°re 

value — during — non OOP/C — cycles — rs — a — don' t — care . Ffre 

interface — ±-s — statically — configured — to — either — ers-e — t+re- 

15 separate connection number bus or to expect the routeword 

on the data bus. 

Figure D shows a 32 bit DID cell transfer. 

Figure G shows a DID back pressure. 

Figure — ? — shows — a — 32 — h±t — EHH3 — packet — transfer — using 

2 0 external connection numb e r bus. 

The unstriper ASIC sends data to the egress port via 
Output Bus (BOB) (also known as DOUT_UN_bl_ch bus), which is a 64 
(or 256) bit data bus that can support either cell or packet.— f^- 
consists of the following signals : 
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This bus can either operate as 4 — separate 32 bit oulpuL 

buses — ( 4xOC40c) — or a single 120 bit wide data bus with a common s^L 
o-f — control — lines — from — a-ti — Unstripers . — This — btre — supports — either 
cells or packets based on software configuration of the unstriper 
5 chip . — It consists of the following signals : 



DOD_Clock : This clock is sourced from the unstriper AGIC 
at up to 100 MHz and is used as a reference for data and 
control signals on the DOD. 



DOD_DP: — This active low input — signal — indicates whether 

10 data cart — be — transferred (inactive) or cannot — be 

transferred — (active) . When back pressure — irs — asserted, 

t+re — unstriper — will — stop — advancing — ttre — output — btts — and 
signal — data — ±-s — not — valid — using — ttre — DOD_valid — signal . 
Since synchronization must be done on both sides of the 
15 interfaces, — 0 clock cycles of data must be allowed from 
the assertion of DP to data stopping. — The source driving 
DOD_DP cannot make any assumptions on the data stopping 
or restarting except by examining DOD_Valid. 



DOD_Valid_L: — This — active — tow — output — signal indicates 
2 0 whether the bus has valid data or not during a transfer. 

This signal indicates invalid data only when DOD_DP has 
been assert e d. 



■ DOD_Data : This is the output bit data bus. — It can either 

be — G4 bits wide or 25G bits wide. — If running in G4 bit 
2 5 mode, — a cell consists of a word with a 4 byte RW and a 4 
byte Header followed by G data words. — A pack e t has a RW 
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and N data words , — where 1 ^ N. — If running in 25G bit mode 
and a — cell — starts — cm — sm — even — 3-2 — byte — word boundary, — a- 
cell has a word with a 4 byte RW a 4 byte header and 24 
bytes of data in the first word, — and a second word with 
5 24 bytes of data. A following cell can start on the next 

used byte and have all fields offset by 0 bytes. — Valid 
cell start — locations are all multiples of 0 — Kb — £b — 1-&7- 
24 ) . — Packets in 120 bit mode work in the same fashion as 
■32 — b±t — mode, — e xcept — that — EOf — and — Er&P — esm — have — larger 

10 values. Minimum packet length supported is 1G bytes. iHr 

half -word — boundary — cell — starts — stre — us e d, — fch-e — correct 
value — ( 0 / 4 ) — needs to be given on the GOP bits — 3:0. 

* DOD_EOP: — This bit is asserted when the last transfer of 
a packet is occurring. 

15 DOD_C e ll_Pkt : — This signal is set to a one to indicate a 

cell transfer and a zero to indicat e a packet transfer. 
Signal needs to b e valid the same clock cycle as start of 
cell. 

■« DOD_GOP/C — This — foi-b — irs — a — zero — to — indicate — a — start of 

2 0 packet or cell on the current bus cycl e . Data is always 

assumed to start at th e most significant bit of the bus. 

Figure 0 shows a G4 bit DOD cell transfer. 



Figure 9 shows a G4 bit DOD packet transfer. 



-35- 



Figure 10 shows an overview of the datapath of the awilch 

ASICs. 



Wre — data — on — t-he — data — htrs — transports — an — optional — byte 

count — (32 bit word, — lower 10 bits are the byte count) — and a 32 bit 

5 egress — routeword . The unstrip e r — core will — always — produce — a byte 

count . i-£ — a — segmentation — engine — « — us e d to — break the — packet — trp 

into cells, — then the segmentation engine will drop the byte count 

word before it is given to the cell interface. This dropping is 

only supported in OC40 mode. In OC192 mode, — the chipset will have 

10 no provisions for segmentation and dropping the byte count word. 



TABLE 9: OC4 0 DOD format 



OC48 Bits OC1 9 2 bi t s fcabrf Usage 

63 : 48 255 - 240 Unused r ese r ved fo r unst r i p e r use 

47:32 239 : 224 Dyte coun t Gives the coun t of t he numbe r of bytes in the p acke t 

no t counting the 4 bytes fo r the egress r outewo r d and 
the by t es for the by t e coun t (basically, th i s co rr es p o n ds 
to the byte count of the r eceived p acke t p lus/minus a n y 
changes for r ccnea p sula tt on, pushes, or p o p s.) 
15 223:192 Egress RW Rou t ewo r d fo r the egress memo r y co nt rolle r 

Nex t bi t s s t a rt the data (bi t s (191 t o 0) fo r 192, nex t 
clock cycle fo r OC48 



The Synchronizer has two main purposes. The first 
purpose is to maintain logical cell/packet or datagram ordering 
across all fabrics. On the fabric ingress interface, datagrams 
arriving at more than one fabric from one port cards ' s channels 

20 need to be processed in the same order across all fabrics. The 
Synchronizer's second purpose is to have a port cards 1 s egress 
channel re-assemble all segments or stripes of a datagram that 
belong together even though the datagram segments are being sent 
from more than one fabric and can arrive at the blade's egress 

25 inputs at different times. This mechanism needs to be maintained in 
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a system that will have different net delays and varying amounts of 
clock drift between blades and fabrics. 

The switch uses a system of a synchronized windows where 
start information is transmit around the system. Each transmitter 
5 and receiver can look at relative clock counts from the last 
resynch indication to synchronize data from multiple sources. The 
receiver will delay the receipt of data which is the first clock 
cycle of data in a synch period until a programmable delay after it 
receives the global synch indication. At this point, all data is 

10 considered to have been received simultaneously and fixed ordering 
is applied. Even though the delays for packet 0 and cell 0 caused 
them to be seen at the receivers in different orders due to delays 
through the box, the resulting ordering of both streams at receive 
time = 1 is the same, Packet 0, Cell 0 based on the physical bus 

15 from which they were received. 

Multiple cells or packets can be sent in one counter 
tick. All destinations will order all cells from the first 
interface before moving onto the second interface and so on. This 
cell synchronization technique is used on all cell interfaces. 
20 Differing resolutions are required on some interfaces. 

The Synchronizer consists of two main blocks, mainly, the 
transmitter and receiver. The transmitter block will reside in the 
Striper and Separator ASICs and the receiver block will reside in 
the Aggregator and Unstriper ASICs. The receiver in the Aggregator 
25 will handle up to 24(6 port cards x 4 channels) input lanes. The 
receiver in the Unstriper will handle up to 13(12 fabrics + 1 
parity fabric) input lanes. 



-37- 



When a sync pulse is received, the transmitter first 
calculates the number of clock cycles it is fast (denoted as N 
clocks) . 

The transmit synchronizer will interrupt the output 
5 stream and transmit N K characters indicating it is locking down. 
At the end of the lockdown sequence, the transmitter transmits a K 
character indicating that valid data will start on the next clock 
cycle. This next cycle valid indication is used by the receivers 

to synchronize traffic from all sources. Refer — to — "K character 

10 usage" on page 34 for the mapping of K characters to the functions. 

At the next end of transfer, the transmitter will then 
insert at least one idle on the interface. These idles allow the 
10 bit decoders to correctly resynchronize to the 10 bit serial 
code window if they fall out of synch. 

15 The receive synchronizer receives the global synch pulse 

and delays the synch pulse by a programmed number (which is 
programmed based on the maximum amount of transport delay a 
physical box can have) . After delaying the synch pulse, the 
receiver will then consider the clock cycle immediately after the 

20 synch character to be eligible to be received. Data is then 
received every clock cycle until the next synch character is seen 
on the input stream. This data is not considered to be eligible 
for receipt until the delayed global synch pulse is seen. 

Since transmitters and receivers will be on different 
25 physical boards and clocked by different oscillators, clock speed 
differences will exist between them. To bound the number of clock 
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cycles between different transmitters and receivers, a global sync 
pulse is used at the system level to resynchronize all sequence 
counters. Each chip is programmed to ensure that under all valid 
clock skews, each transmitter and receiver will think that it is 
5 fast by at least one clock cycle. Each chip then waits for the 
appropriate number of clock cycles they are into their current 
sync_pulse_window. This ensure that all sources run N* 

sync_pulse_window valid clock cycles between synch pulses. 

As an example, the synch pulse window could be programmed 
to 100 clocks, and the synch pulses sent out at a nominal rate of 
a synch pulse every 10,000 clocks. Based on a worst case drifts 
for both the synch pulse transmitter clocks and the synch pulse 
receiver clocks, there may actually be 9,995 to 10,005 clocks at 
the receiver for 10,000 clocks on the synch pulse transmitter. In 
this case, the synch pulse transmitter would be programmed to send 
out synch pulses every 10,006 clock cycles. The 10,006 clocks 
guarantees that all receivers must be in their next window. A 
receiver with a fast clock may have actually seen 10,012 clocks if 
the synch pulse transmitter has a slow clock. Since the synch 
pulse was received 12 clock cycles into the synch pulse window, the 
chip would delay for 12 clock cycles. Another receiver could seen 
10,006 clocks and lock down for 6 clock cycles at the end of the 
synch pulse window. In both cases, each source ran 10,100 clock 
cycles . 

25 When a port card or fabric is not present or has just 

been inserted and either of them is supposed to be driving the 
inputs of a receive synchronizer, the writing of data to the 
particular input FIFO will be inhibited since the input clock will 
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not be present or unstable and the status of the data lines will be 
unknown. When the port card or fabric is inserted, software must 
come in and enable the input to the byte lane to allow data from 
that source to be enabled. Writes to the input FIFO will be 
5 enabled. It is assumed that, the enable signal will be asserted 
after the data, routeword and clock from the port card or fabric 
are stable. 

At a system level, there will be a primary and secondary 
sync pulse transmitter residing on two separate fabrics. There 

10 will also be a sync pulse receiver on each fabric and blade. This 
can be seen in Figure [[11]] 5.. A primary sync pulse transmitters 
will be a free-running sync pulse generator and a secondary sync 
pulse transmitter will synchronize its sync pulse to the primary. 
The sync pulse receivers will receive both primary and secondary 

15 sync pulses and based on an error checking algorithm, will select 
the correct sync pulse to forward on to the ASICs residing on that 
board. The sync pulse receiver will guarantee that a sync pulse is 
only forwarded to the rest of the board if the sync pulse from the 
sync pulse transmitters falls within its own sequence "0" count. 

20 For example, the sync pulse receiver and an Unstriper ASIC will 
both reside on the same Blade. The sync pulse receiver and the 
receive synchronizer in the Unstriper will be clocked from the same 
crystal oscillator, so no clock drift should be present between the 
clocks used to increment the internal sequence counters. The 

25 receive synchronizer will require that the sync pulse it receives 
will always reside in the "0" count window. 

If the sync pulse receiver determines that the primary 
sync pulse transmitter is out of sync, it will switch over to the 
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secondary sync pulse transmitter source. The secondary sync pulse 
transmitter will also determine that the primary sync pulse 
transmitter is out of sync and will start generating its own sync 
pulse independently of the primary sync pulse transmitter. This is 
5 the secondary sync pulse transmitter's primary mode of operation. 
If the sync pulse receiver determines that the primary sync pulse 
transmitter has become in sync once again, it will switch to the 
primary side. The secondary sync pulse transmitter will also 
determine that the primary sync pulse transmitter has become in 

10 sync once again and will switch back to a secondary mode. In the 
secondary mode, it will sync up its own sync pulse to the primary 
sync pulse. The sync pulse receiver will have less tolerance in 
its sync pulse filtering mechanism than the secondary sync pulse 
transmitter. The sync pulse receiver will switch over more quickly 

15 than the secondary sync pulse transmitter. This is done to ensure 
that all receiver synchronizers will have switched over to using 
the secondary sync pulse transmitter source before the secondary 
sync pulse transmitter switches over to a primary mode. 

Figure [[11]] 5. shows sync pulse distribution. 



20 In order to lockdown the backplane transmission from a 

fabric by the number of clock cycles indicated in the sync calcu- 
lation, the entire fabric must effectively freeze for that many 
clock cycles to ensure that the same enqueuing and dequeueing 
decisions stay in sync. This requires support in each of the 

25 fabric ASICs . Lockdown stops all functionality, including special 
functions like queue resynch. 
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The sync signal from the synch pulse receiver is 
distributed to all ASICs. Each fabric ASIC contains a counter in 
the core clock domain that counts clock cycles between global sync 
pulses. After the sync pulse if received, each ASIC calculates the 
5 number of clock cycles it is fast. (8). Because the global sync is 
not transferred with its own clock, the calculated lockdown cycle 
value may not be the same for all ASICs on the same fabric. This 
difference is accounted for by keeping all interface FIFOs at a 
depth where they can tolerate the maximum skew of lockdown counts. 

10 Lockdown cycles on all chips are always inserted at the 

same logical point relative to the beginning of the last sequence 
of "useful" (non-lockdown) cycles. That is, every chip will always 
execute the same number of "useful" cycles between lockdown events, 
even though the number of lockdown cycles varies. 

15 Lockdown may occur at different times on different chips. 

All fabric input FIFOs are initially set up such that lockdown can 
occur on either side of the FIFO first without the FIFO running dry 
or overflowing. On each chip-chip interface, there is a sync FIFO 
to account for lockdown cycles (as well as board trace lengths and 

20 clock skews). The transmitter signals lockdown while it is locked 
down. The receiver does not push during indicated cycles, and does 
not pop during its own lockdown. The FIFO depth will vary 
depending on which chip locks down first, but the variation is 
bounded by the maximum number of lockdown cycles. The number of 

25 lockdown cycles a particular chip sees during one global sync 
period may vary, but they will all have the same number of useful 
cycles. The total number of lockdown cycles each chip on a 
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particular fabric sees will be the same, within a bounded 
tolerance . 

The Aggregator core clock domain completely stops for the 
lockdown duration - all flops and memory hold their state. Input 
5 FIFOs are allowed to build up. Lockdown bus cycles are inserted in 
the output queues. Exactly when the core lockdown is executed is 
dictated by when D0UT_AG bus protocol allows lockdown cycles to be 
inserted. DOUT_AG lockdown cycles are indicated on the DestID bus. 

The memory controller must lockdown all flops for the 
10 appropriate number of cycles. To reduce impact to the silicon area 
in the memory controller, a technique called propagated lockdown is 
used. 

The aggregator signals lockdown cycles on the DIN_ME bus . 

1 Hte — memory — controller — do e s — rrot — push — during — thes e — cycles . IHre 

15 memory controller does not pop during lockdown to account — for the 

non - push cycles . L E4re FIFO depth ±-s set during fabric 

synchronization to tolerat e getting deeper or shallower depending 
on who locks down first. 

Lockdown idle cycles are insert e d on the DOUT and CI1_ID 

2 0 busses . — An extended sync signal is used to indicate the number of 
lockdown cycles on the D0UT_ME bus to aid the S e parator's lockdown 
function . 

The token bus lockdown looks the same as the DIN_ME bus 

from a memory controller persp e ctive. — Non - push cycl e s are signaled 



-43- 



by — the — separators — according — to — their — lockdowns . Wte — memory 

controller does not pop during loekdown. — The Oeparator locks down 
compl e tely in a manner similar to the Aggregator. — DIN_OP and CII_ID 
loekdown — cycles — are — signaled — individually — per-bus — ^n-a — the — GYNC 

5 signals . toy — continuous — GYNC — assertion — after — the — first — orre — rs- 

considered a loekdown cycle. Loekdown bus cycles are not pushed 

into the input FIFOs. 



ftee — chip - to -chip — communication — within — a — single — fabric 

must — be — synchronized. Although . no — clock — drift — exists — between 

10 chips, — differences — in — track — delays — cause — data — to — arrive — srt 

different — Memory Controllers — a± — different — times . fri-k — Memory 

Controllers need to process incoming packets in exactly the — same 
logical order on each chip. — The Geparators must align and combine 
multiple data slices coming from different Memory Controllers. — Wre 

15 Memory — Controllers — must — take the — tokens received — from — tire 

Geparators and apply them at exactly the same point in the logical 
packet flow, — or drop decisions may diff e r — from chip to chip. 



The on-fabric chip-to-chip synchronization is executed at 
every sync pulse. While some sync error detecting capability may 

20 exist in some of the ASICs, it is the Unstriper's job to detect 
fabric synchronization errors and to remove the offending fabric. 
The chip-to-chip synchronization is a cascaded function that is 
done before any packet flow is enabled on the fabric. The 
synchronization flows from the Aggregator to the Memory Controller, 

25 to the Separator, and back to the Memory Controller. After the 
system reset, the Aggregators wait for the first global sync 
signal. When received, each Aggregator transmits a local sync 
command (value 0x2) on the DestID bus to each Memory Controller. 
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The Memory Controllers do not push anything into a DIN 
input FIFO until the first sync command is seen on that bus. The 
sync and every bus cycle following is constantly pushed into the 
input FIFO. On the core side of the input FIFOs, no FIFO is popped 
5 until a sync appears in the FIFO from every Aggregator. After two 
additional margin cycles, every input FIFO is popped every cycle. 
After this point the input FIFO depths remain constant. The depths 
are roughly a function of the track delays from each Aggregator. 
Immediately after the Memory Controllers begin sampling the 
10 Aggregator input FIFOs, a sync signal (S_SYNC_L) is transmitted to 
all Separators on the DOUT and CH_ID busses. 

Like the Memory Controllers, the Separators do not push 
into the DIN and CH_ID busses until a sync signal is received on 
that bus. The sync and everything after is constantly pushed into 
15 the input FIFO. 

On the core side the Separator always waits until at 
least one word is present on all input busses, and then pops the 
CH_ID and DIN busses simultaneously. This will logically align the 
data stripes coming from the Memory Controllers. After the first 
20 combined sync is popped from the input FIFOs, the Separators send 
a sync signal on the TOKEN bus to the Memory Controllers. 

The Memory Controllers do not push into the TOKEN bus 
input FIFO until a sync signal — (0x3F on the token bus) — has been 

seen on the bus. The sync and all subsequent tokens and idles are 

25 always pushed. 
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All Memory Controllers need to apply the received tokens 

to — the — same — point — in — the — incoming — logical — flow — dm — order — for — all 

drop decisions to be — id e ntical . This — is don e by waiting a worst 

case number of clock cycles after the Geparator sync transmission 
5 before beginning to pop the token input FIFO. — The worst case delay 
must be used because there is no way for a single Memory Controller 
to know exactly when all other Memory Controllers have received a 
token . — The programmable delay stored in the 1G bit Token Sync Wait 
Register — irs — in — "usef ul" — cycles — ( 125MIIz) — that — do — not — include — the- 

10 fabric — lockdown — cycles . The — worst — case — delay — is — the — worst — cas e 

skew for — sriri — data paths — going — from the Aggregator — to Memory Con 
troller to Geparator and back to Memory Controller. 

The following Table 10 gives the min/max delays which the 

chipset — supports — and represent — the — limits — of what — is verified in 
15 the chip verification process. 

Gync — pulse — transport — delay — from — Transmitter — to — any 

individual chip receiving the sync pulse — (WC path DC path) : — £rfr& 

rrS — (min delay of — 6b — max delay of D00 — nG) . At — 175 ps/inch, — this 

works out to a difference of about 70m. — Dackplane transport delay 
2 0 difference from local sync pulse r e ceipt to reception of the sync 

indication — flag — by — the — fsr — end — chips : — 5-0-8 — rrS-. Note — that — i± — is- 

desired — to — allot — about — 2-5 — rrS — of — this — to — the — chip — synchronizer 
operation which gives a delta path delay supported of 500 nG. 

Oscillators should — be 3r6-8 ppm — oscillators . The 

2 5 assumption — of — the — design was — that — the — differ e nce — in — transmission 

path delay was less than or equal to clock drift. On board delays 

between chips hav e been designed to exceed the following specs : 
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Shortest net : — 0.25", — transport delay of pretty much 0. 
Longest net : — 25", — transport delay is 5 nS. 

For any signal distribution. — The net delta delay between 

chips — irs — a multiplier — of the number — of busses — the — sync has — tra 

5 versed. Since the sync goes through a receive synchronization to 

t+re — local — clock — oi — t+ne — chip, — an — h/ 8 — rrS — uncertainly — hers — to — be- 

added at each — stage giving a n e t — uncertainty of — around 21 — rrS — fcrr 
each hop. 



TABLE 10 : — Fabric sync d e lay 



15 



10 ertp 



Memory 
controller 

r\ixi 
lTJTy 

Se p DIN 



memory 
cont r olle r 
t oke n in 



Number 
busses 

+ 
2 



ofSkc w 



z, I I1J 



/r-t _ o 
U_T IIJ 



Notes 

Sy n c p ulse in 

Syne p ulse t o agg I agg_me delta 



Syne pulse to agg i agg_mc I mc_se p 
(note t his sync p ulse is delayed by t he 
memo r y — cont r olle r fo r — pr o p aga t ed 
loekdown). 

Eve r ything a bo ve I sep_me t okens. 



The — control — port — follows — the — same — cell — flow — a-s — the- 

2 0 regular — ports . — ¥he — switch — control — processor — s e nds — cells — to — the 
striper ASIC; the striper stripes the cells and route words across 
a-H: — fabrics . — An additional — aggregator — ( 9th) — ASIC — sends — cells via 
the DOUT_AG/DestID buses to all 12 memory controllers. — Each memory 
controller ASIC has an additional 9th DIN_ME_fb_se_9 bus. — 

2 5 c f4 Q re — memory — controller — ASIC — will — route — the — incoming 

control — port — cells — to — arry — one — erf — the — control — port — destination 
queues and blade queu e s — (up to 19G qu e ues) . The 9th D0UT_ME_f b_se_9 
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bus is used to send the control cells to the 9th separator AOIC, 
which — sends — t+ns — cells — to — one — erf — several — destination — unstriper 
AOICs . — c Phe — unstriper — AOIC — reconstructs — the — cells — from — orirk — 9-bh 
separator ASICs across all fabrics. — It sends the complete control 
5 cells to the switch control processor — it is connected to. 



Note that the control port destination queu e s can be part 

of any multicast cells such that the multicast port mask is neces 1 
sary — to — include — additional — bit (s) — to — indicate — the — control — port 
queue (s) . — 



10 There — are — at — most — 4 — control — ports — irt — arty — switch 

configurations . — This — limitation — ±s — dtte — to — the — aggregator — errrd 
separator ASICs only have 4 12 bit channels which can be scalable 
to different switch configurations, — respectively. — In other words, 
btrs DIN_AG_fb_9_l_l, DIN_AG_f b_9_2_l , DIN_AG_f b_9_3_l , arrdc 

15 DIN_AG_fb_9_4_l — of the aggregator AOIC are connected to up to — 4- 
control port striper AOICs . Dus DOUT_0P_f b_9_l_l , D0UT_0P_f b_9_2_l , 
DOUT_0P_f b_9_3_l , — and DOUT_3P_f b_9_4_l of the separator AOIC are 
connected to up to 4 — control port unstriper AOICs. 



The striping function assigns bits from incoming data 
20 streams to individual fabrics. Two items were optimized in deriving 
the striping assignment: 



1. Backplane efficiency should be optimized for OC48 
and OC192. 

2. Backplane interconnection should not be 
25 significantly altered for OC192 operation. 
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These were traded off against additional muxing legs for 
the striper and unstriper ASICs. Irregardless of the optimization, 
the switch must have the same data format in the memory controller 
for both OC48 and OC192. 

Backplane efficiency requires that minimal padding be 
added when forming the backplane busses. Given the 12 bit backplane 
bus for OC48 and the 48 bit backplane bus for OC192, an optimal 
assignment requires that the number of unused bits for a transfer 
to be equal to (number_of_bytes *8 ) /bus_width where V is integer 
division. For OC48, the bus can have 0, 4 or 8 unutilized bits. For 
OC192 the bus can have 0, 8, 16, 24, 32, or 40 unutilized bits. 

This means that no bit can shift between 12 bit 
boundaries or else OC48 padding will not be optimal for certain 
packet lengths. 

For OC192c, maximum bandwidth utilization means that each 
striper must receive the same number of bits (which implies bit 
interleaving into the stripers) . When combined with the same 
backplane interconnection, this implies that in OC192c, each stripe 
must have exactly the correct number of bits come from each striper 
which has 1/4 of the bits. 

For the purpose of assigning data bits to fabrics, a 48 
bit frame is used. Inside the striper is a FIFO which is written 32 
bits wide at 80-100 MHz and read 24 bits wide at 125 MHz. Three 32 
bit words will yield four 24 bit words. Each pair of 24 bit words 
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is treated as a 48 bit frame. The assignments between bits and 
fabrics depends on the number of fabrics. 



TABLE 11: Bit striping function 
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The following tables give the byte lanes which are read 
first in the aggregator and written to first in the separator. The 
four channels are notated A, B, C, D. The different fabrics have 
different read/write order of the channels to allow for all busses 
5 to be fully utilized. 



One fabric-40G 



The next table gives the interface read order for the 
aggregator. 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


B 


C 


D 


Par 


A 


B 


C 


D 


Two fabric 


-80G 








Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


C 


B 


D 


1 


B 


D 


A 


C 


Par 


A 


C 


B 


D 


120G 


Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


D 


B 


C 


1 


C 


A 


D 


B 


2 


B 


C 


A 


D 


Par 


A 


D 


B 


C 



Three fabric-160G 





Fabric 


1st 


2nd 


3rd 


4th 


25 


0 


A 


B 


C 


D 
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1 


D 


A 


B 


C 


2 


C 


D 


A 


B 


3 


B 


C 


D 


A 


Par 


A 


B 


C 


D 



5 Siz fabric-240 G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


D 


C 


B 


1 


B 


A 


D 


C 


2 


B 


A 


D 


C 


3 


C 


B 


A 


D 


4 


D 


C 


B 


A 


5 


D 


C 


B 


A 


Par 


A 


c 


D 


B 



Twelve Fabric-480 G 



Fabric 


1st 


2nd 


3rd 


4th 


0,1,2 


A 


D 


C 


B 


3,4,5 


B 


A 


D 


C 


6,7,8 


C 


B 


A 


D 


9,10,11 


D 


C 


B 


A 


Par 


A 


B 


C 


D 



Interfaces to the gigabit transceivers will utilize the 
transceiver bus as a split bus with two separate routeword and data 
busses. The routeword bus will be a fixed size (2 bits for OC48 
ingress, 4 bits for OC48 egress, 8 bits for OC192 ingress and 16 
25 bits for OC192 egress), the data bus is a variable sized bus. The 
transmit order will always have routeword bits at fixed locations. 
Every striping configuration has one transceiver that it used to 
talk to a destination in all valid configurations. That 
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transceiver will be used to send both routeword busses and to start 
sending the data. 

The backplane interface is physically implemented using 
125 MHz interfaces to the backplane transceivers. The 125 MHz bus 
5 for both ingress and egress is viewed as being composed of two 
halves, each with routeword data. The two bus halves may have 
information on separate packets if the first bus half ends a 
packet . 

For example, an OC48 interface going to the fabrics 
10 locally speaking has 24 data bits and 2 routeword bits @125 MHz . 
This bus will be utilized acting as if it has 2x (12 bit data bus 
+ 1 bit routeword bus) . The two bus halves are referred to as A 
and B. Bus A is the first data, followed by bus B. A packet can 
start on either bus A or B and end on either bus A or B. 

15 In mapping data bits and routeword bits to transceiver 

bits, the bus bits are interleaved. This ensures that all 
transceivers should have the same valid/invalid status, even if the 
striping amount changes. Routewords should be interpreted with bus 
A appearing before bus B. 

20 The bus A/Bus B concept closely corresponds to having 

Mf*z— interf aces between chips. 

All backplane busses support fragmentation of data. The 
protocol used marks the last transfer (via the final segment bit in 
the routeword) . All transfers which are not final segment need to 
25 utilize the entire bus width, even if that is not an even number of 
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bytes. Any given packet must be striped to the same number of 
fabrics for all transfers of that packet. If the striping amount 
is updated in the striper during transmission of a packet, it will 
only update the striping at the beginning of the next packet. 

5 Each transmitter on the ASICs will have the following I/O 

for each channel: 

8 bit data bus, 1 bit clock, 1 bit control. 

On the receive side, for channel the ASIC receives 

a receive clock, 8 bit data bus, 3 bit status bus. 

10 The switch optimizes the transceivers by mapping a 

transmitter to between 1 and 3 backplane pairs and each receiver 
with between 1 and 3 backplane pairs. This allows only enough 
transmitters to support traffic needed in a configuration to be 
populated on the board while maintaining a complete set of 

15 backplane nets. The motivation for this optimization was to reduce 
the number of transceivers needed. 

The optimization was done while still requiring that at 
any time, two different striping amounts must be supported in the 
gigabit transceivers. This allows traffic to be enqueued from a 
20 striping data to one fabric and a striper striping data to two 
fabrics at the same time. 

In all modes — of operation ^ — the — e ntire — 3.0G of data — ts- 
always supported on switch ingr e ss. For egr e ss operation, — for 40G 
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smdi — 00G / — the — number — erf — transceivers — needed to — support — a — full — zh? 

speedup — wa-s — deemed — to — expensive . Perr — these — switch — modes , — t+re 

output speedup is between 1.0 and 2. — All configurations above 00G 
support a full 2x speedup. 

5 Depending on the bus configuration, multiple channels may 

need to be concatenated together to form one larger bandwidth pipe 
(any time there is more than one transceiver in a logical 
connection. Although quad gbit transceivers can tie 4 channels 
together, this functionality is not used. Instead the receiving 
10 ASIC is responsible for synchronizing between the channels from one 
source. This is done in the same context as the generic 
synchronization algorithm. 

The 8b/10b encoding/decoding in the gigabit transceivers 
allow a number of control events to be sent over the channel. The 
15 notation for these control events are K characters and they are 
numbered based on the encoded 10 bit value. Several of these K 
characters are used in the chipset. The K characters used and 
their functions are given in the table below. 



TABLE 12: K Character usage 



20 



K character Function 

28.0 Sync indication 



28.1 
28.2 



28.3' 



2 5 28.4 

28.5 
28.6 



Lockdown 
Packet Abort 



Resync window 



BP set 



Idle 
BPclr 



Notes 

Transmitted after lockdown cycles, treated as the prime 

synchronization event at the receivers 

Transmitted during lockdown cycles on the backplane 

Transmitted to indicate the card is unable to finish the 

current packet. Current use is limited to a port card 

being pulled while transmitting traffic 

Transmitted by the striper at the start of a synch 

window if a resynch will be contained in the current 

sync window 

Transmitted by the striper if the bus is currently idle 
and the value of the bp bit must be set. 
Indicates idle condition 

Transmitted by the striper if the bus is currently idle 
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and the bp bit must be cleared. 

The switch has a variable number of data bits supported 
to each backplane channel depending on the striping configuration 
30 for a packet. Within a set of transceivers, data is filled in the 
following order: 

F [ fabric] _ [ocl92 port number] [oc48 port designation 
(a,b,c,d) ] [ transceiver_number] 

Everything — in — the — documentation — rs — done — for — f abric-1 , 

35 which is the case where all connections are needed. The only part 

of this which is used for fill order is transceiver_number — (0040) 
and transceiver number and oc4Q port designation for OC192. 

The fundamental rules for mapping are th e following: 

■3h EH? — i — RW are on transceiver 1 — These always occupy the first 4 

4 0 bits of the transceiver. 

9-. Data bits starting with the least significant bit — stre — filled 

into the data bus — in a 2 bit bit interleaved pattern, — with bus A 
and bus D pairs. 

3-: — Transceivers are filled in starting at bit 0 of their transmit 
4 5 and receive interfaces. 

•4-: — All multibit routeword fields are transmitted LOD to MOD. — This 
includes connection number, number of fabrics and encoded values of 
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stop /align/ final — segment . c H°re — overall — routeword — ars — notated — srs- 

starting from bit 0 — (least significant bit) and up. — Transmit order 

is Dit 0 — ( SOP) — goes on the first routeword bit, followed by bit 1 

(Packet type) . If multiple routeword bits are transmitted in the 

5 same clock they are filled in starting with the first bit going to 
fcrdrfe — Q-f — the second bit going to bit — tr 

5i Data — should — foe — encoded — artd — decoded — based — on — a — btrs — A/Dus — & 

order. 

-G-. Fm — OC192, — t-fre — fill — order — should be — fotrs — — &t — 67 — B — for 

10 routeword bits. Fcrr — data bits, — the — fill — ord e r — depends — on wack 

ing/unwacking/reverse unwacking and reverse wacking functions. 

Transceiver 1 

For an ingress bus, — the format of data is the following: 



&rfc- 






Birfe- 


-i- 


— e- 


Bi-fc- 


S- 


RWA 


B±t- 


-e- 


RWD 


B±t- 


-k- 


■ Dataa (0) 





-&- 


■ Dataa (1) 


Bitr 

&i±- 


-6- 

-=h 


Datab (0) 
Datab (1) 



Note that for 12 fabric mode, — bits 5 and 7 are unused. 

The location of datab (0) do e s not change. 
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Pot — the — egress — bus, — the — format — erf — the — data — rs — the 

following : 



Birfc- 


-e- 


RWA ( 0 ) 


Bi-fc- 


-i- 


RWA ( 1 ) 


B±t- 


-2- 


RWD ( 0 ) 


Brtr 


-e- 


RWD ( 1 ) 


Bit- 




Dataa (0) 


B±-fc- 


-5- 


Dataa (1) 


Bi-fc- 

B±t- 


-6- 
-=h 


Datab (0) 
Datab (1) 



Transceiver 2 and up 

Fill up the data bus starting at each transceiver bit 0 

to bit 7 with 2 bit interleaved 

; dataa/datab patterns. — 

15 For example, — transceiver 2 has the following pattern: 



Brt- 




dataa (2) 


Bdrfc- 


-i- 


— dataa (3) 


B±tr 


-2- 


datab (2) 


Brfe- 


-6- 


datab(3) 


B±t- 


-k- 


-Dataa (4) 


B±t- 


-&- 


-Dataa (5) 


B±t- 

Bit- 


-6- 
-7- 


Datab (4) 
Datab (D) 



The stop/align encoding depends on the width of the bus interface. 



2 5 TABLE 13 : OC40 portcard to fabric rout e word 3top/align 



r?* -i -i 

I IV-IVJ 




Function 


Stop/ Align 
TO 


2 i n (where 
n 13 the 


In this mode, this field is stop & align & final_segment. 



-58- 





clock cycles 


Stop bit ia a 1 te indicate ne stop, zero indicates atop. Stop bits repeat in a serial stream until a 


of transfer) 


stop bit of zero is seen, followed by the align bit and PS. Sinee stop is followed by the align and 


FG bits, the stop bit is given 2 eloek eyeles before the end of data. 


Align bit is a one to indicate valid data on the last complete byte on the interface. For odd 1 2 bit 


words(assuming zero based counting), align 0 indicates bits 0:3 are valid, and bits 4: 1 1 arc 


invalid. Align - 1 for these words indicates that all 1 2 bits are valid. For even words, align should 
normally be a 1. 






Final segment is a one to indicate a final segment of a packet and a zero to indicate a partial 


segment of a packet. Only one packet ean be in transit at any one time on this bus. This bit is only 











TADLE 14 : — OC192 portcard to fabric routeword stop/align 



30 



r?: -i .i 
I' IC1U 



Leng t h 



Func t i o n 



Stop/Align 



3 I 4 * 
number of 
extra cloeks 



Due to leng t h res tr ic t ions on this bus, the sto p /align has to be trea t ed d i ffe r ently than for QC48 



t r a n sfe r s. 



The fi r st clock cycle, this f i eld is 3 b i ts lo n g and is no t a t ed as GATO. In all fu t u r e cloek cycles t he 



s t op field is 4 bits long and no t a t ed SAF1. The definit io ns of GAFO and SAF1 arc given bel o w 



jAF0(0). Di t ze r o i s a zero to i ndicate a sto p , a one t o in d i ca t e no s t o p . 
iAF0(2: l) - "00" i ndica t es full wo r d tr ansfe r . 
01" indica t es a full wo r d tr a n sfer bu t fo r a sho r t p acke t , 
i 0" indica t es a full wo r d tr a n sfer bu t no t t he final segment. 
11" is r ese r ved. 



GAn (0) D it ze r o is a ze r o t o i ndica t e a s t op, a o n e t o i ndica t e no s t op on t he curren t cycle. 



GAFl(3:l) - binary value of t he numbe r of val i d by t es. Ze r o is rese r ved and 7 i s used t o i nd i ca t e 



6 by t es val i d bu t n o t t he f i nal segmen t . 0 ind i ca t es 6 by t es valid and fi n al segmen t . All p art i al 



w or d t r a n sfers au to ma t ically in dica t e an implied f i nal segmen t . 



TADLE 15: — OC48 Fabrie^Port card routeword stop/align 



I' ICIU 


Length 


Function 


Gtop/Align 


3 i 2 * 


Value is treated as a repeated 2 bit value (eneoded stop) followed by the final segment bit. 
Gtop field is interpreted as: 
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00 - 1 s t byte finished is valid and s t o p 

01 - 2nd bytes finished is valid and s top 

lOOrd byte fi ni shed is valid and stop, or non - final segmen t . 
Shor t packets a r e i ndicated by flagg i ng a stop at by t e 53. 

Tinal segme nt is a o n e fo r a fi n al segmen t , a ze r o for a c o n t inu i ng paeke t . For final segmen t s, 
the s t op field should be e n coded as a "10" 



The port card - fabric interface at OC192 variable routcword bits arc given in the tabic below. 



TABLE 16 : — OC192 Fabrioport ca r d routcwo r d sto p /align 



r?' -i-i 
1' 1LIU 


Length 


Function 


Stop/Align 


7 l 8* numbci 


Bft 0 indicates stop. Zero indicates stop, 1 continue. 


transfer 






Values OxC, OxF are reserved. Any non-12 byte ending offset automatically signals end of segm 


cycle of data. 

Short packets are indicated by flagging a stop at byte 53. 









Depending — em — the — switch configuration, — the bus may not 

transfer — an — integer — numb e r — erf — bytes . This — i-s — handled — by — the- 

interface always flagging the bytes which finish and the transmit 
wrd — receive — state — machines — must — track where — bytes — begin — and — end 
4 5 based on the current cycle in the transfer. 

"Pfre — btrs — consists — erf — a — multiplexed — address/data — fotrs- 

(AD_DATA) , a select signal (ADJjEL_L) , a read/write signal (AD_RW) , 
and a bus transaction complete indication signal (AD_RDY_L) . AD bus 
irs — used for — read/write access of control/status — registers . 

50 in — order — to — write — to — a — control /status — register, — the 

r e ad/write signal (AD_RW) must b e low. Th e sel e ct signal (AD_3EL_L) 
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must — be — asserted — tcrw — fw — the — entire — duration — erf — the — access, — and- 
values must be placed on the AD_DATA bus in the following sequence 
(cycle — 6 — rs — the — first — cycle — where — AD_0EL_L — i-s — tow — f-or — this 
transaction) : 

5 cycle 2 5: Data to be written to control/status register. Ferr 

registers — that — are — wider — than — 0-bits — (maximum — erf — 32 - bits ) 
write data must be presented one byte per cycle starting with 
LSD . — Any data presented on the bus beyond the width of the 
register will be ignored. 

10 cycles — > — 5-: — ASIC — will — assert AD_RDY_L on — completion — of the 

write access, — and will keep it asserted until AD_3EL_L is de- 
asserted. — 

Figure 12 shows a Write Cycle. 

in — order — to — read — from — a — control/status — register, — the 

15 read/write signal (AD_RW) must — be — high. I J L he select signal 

(AD_3EL_L) — must — be — asserted — kw — £m? — the — entire — duration — erf — the 
access, — arrd — values — must — be — placed — on — the — AD_DATA — btrs — in — the 
following sequence — (cycle 0 — is the first cycle where AD_GEL_L is 
low for this transaction) : 

2 0 cycle 0 1 : — Address of control/status register 

cycle 2: AD_DATA bus should be released — (hi - z) 

* cycles — — Wh e n the data — is availabl e , — ABIC will drive th e 
read data onto the bus, — on e byte per cycle for four cycles, 
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along with assertion of AD_RD¥_L signal. For registers smaller 
than 32 bits wide, unused bits are presented as zeros. The LOD 
is present on the bus during the 1st clock cycle of AD_RDY_L 
assertion. - 

5 Figure 13 shows a Read Cycle. 

Tire — switch — chips — will — generate — interrupts — on — error 

conditions . 5 L he interrupt lines have t+re following 

characteristics : 

^rz Level Oensitive 

10 7h Active Low 

3-. Asynchronous (-rre — clock — generated — to — go — along — with — t-h-e 

interrupt) . 

■4-: Assume — point — to-point — interconnection — with — board — logic which 

combines together interrupts. 

15 Interrupts are maskable on a condition by condition basis 

inside — each — chip: 'Fhe — interrupt — signal — is — asserted — cm — the 

occurrenc e — erf — art — error — condition — artd — is — cleared — when — the — error 

condition — is — cleared. Any temporary conditions — which — caused an 

interrupt are recorded in the chip so no phantom interrupts should 

20 be seen. 
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The reality of the switch is that errors will occur. — "Fire 

intent in the following is to detail the expected system behavior 
and recovery strategy needed for each error type. 



TABLE 17 : Error r e cov e ry in th e ASICs 



Error 






Hardware comments 












HUill UIIW IUUIIV 










Stuck bit between agg & 


unstripcr sees data corruption 


memory controller 


from one fabric, either route 
word or data. 
















word or data 


Stuek bit on fabrie egress 
























Soft-fail on route word from 


At least two un stripers see either 


Queue re synch 


Worst — ease — scenario involves 


port card 


a routcword mismatch, a state 


Failing routcword with different 


Willi U lllgll IIUII1UV1 Ul 1UUIVVVUIU 


IUUIIV IUUIVVVUIUJ l\J 


1 1 UOIUUIVIIV/O, V I UULU pUIILJf HIUIC 


i lj iv j vjuvuving u puvrv^i l\J 11 IV. 


Ui ally IIUIIiUvi Ul uiimiijjvio Will 


WlUIIg JJUI I \J1 UIV[ypIMj3 UIv 


3vC™a" TOulvWUlU 1 1 1I3II1UHII, a 


UlIIIH MI UIL LL^UILFI villi 


high number of routcword mis 


cause an — impact to all ports. 


matches or data parity errors and 




Probability of impacting more 


an aggregator will see a syneh 
erroTr 




ports goes up with traffic load 
and — memory — utilization — m 
memory controllers. 






None 




Soft-fail on data from port- 
card 


Unstripcr sees one time error, 
probability of automatic hard 
ware based data recovery is high 












Soft-fail between agg/memory 


At least two unstripers sec either 


Queue resyneh 


controller dcst_id bus 


















controller data bus 


probability of automatic hard 












At least two unstripers see either 




Tokens get out of syneh. May 


bus 


a routcword mismatch, a state 










the — separator, — depending — on 








controller/separator data bus for 


Packet — boundaries — from — one 
separator port are lost. Unstripcr 


Queue Resyneh 


Inherent that no self-stabilige in 
occurs w/o queue resyneh. 
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RW data 


affected aggregator output- 










M 




soft-fail between memory 
paeket data 


Single port sees one-time error. 




Mismatches from fabric due to 






soft-fail on token bus from 




Queue Resynch 


separator to memory controller 


differences — m — separator 
scheduling. 






Reset 




soft-fail internal to fabric chips 


Unstriper sees different traffie 




Queue — Resynch — may fix the 


from fabrie than other fabries 


problem, reset is necessary for 
restoring state. 






















Replace faulty hardware. 




plane idle to synchronize to nv 


indicating it has seen baek plane 


synch 


Aggregator — never sets — fttrg 






indicating it has seen back plane 

3ync 




Lit V IVpi/ltlllg piUUlCIUJ ^lUJl UUlll 










1 1 IVi 1 LU) J tUI 1 LI Wl 1^1 UVJVJ IJVSl O W 


IVVU) ibOTUWII \It 11 L/bl 1 1 IUI l^-l 1 1 










separator does not see synch 


Separator — never — gets — initial 
synch 




unstriper does not sec back 


Unstriper never gets baek plane 
synch 


replace faulty hardware 








Initialize the hardware 




fabric chips not initialized 


Chips do not do anything 




Fault can be caused by failure of 
the on-board processor, if soft- 
fail, watchdog should cateh it. 












plane 












Unstriper no initialized 


AH incoming data ignored 


Initialize unstriper 










Jlll|JL aillUUIIl llUJUITXvt 


{Jl lUIUUlg UUlu u uiuli|jvu in 






striper, interrupt asserted 


a — disagreement — between — the 
3'trfpc amount and the 
configuration — register for the 
3witeh operating mode. 
































Secondary syne pulse TX 
failure 


Synch pulse receiver on all 




If leaving reset, no chips on 
board get in syne. — If during 


Replace board with bad synch 
pulse receiver 
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operation, should see a syneh 
error citncr in an aggregator oi 
utt unstriper icq oy mis niucrc. 




source. 




None 








tf — any — FIFOs — overflow — m 


internal to the board 


i. 

rcsynen 










I lard failure on sync pulse 






distribution to a single chip on 

ii Fabric 


cfrrp: — Additionally, if data is 

report data corruption from the 
associated fabric. 












unstripcr-May sec what looks 


Reset port card 


game as below. 


distribution to a single chip on 






before the others. 




M 






soft failure on syne pulse 


If no FIFO overflow, none. — H 


Striper — missing — synch — ptrtec 


distribution to a single chip on 


FIFO overflow, need to reset 


could overflow a TIFO on every 






be done serially and switch 
could be effectively down by 

311 IUUIICoUU lllv 331111. LIlIHg 13 IVJ 

the striper and would require a 
port card reset to recover. 






Reset the fabric 




on a fabric 


unknown 












soft failure on syne pulse 






Same as single-failure. 











45 The chipset implements certain functions which are 

described here. Most of the functions mentioned here have support 
in multiple ASICs, so documenting them on an ASIC by ASIC basis 
does not give a clear understanding of the full scope of the 
functions required . 
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The switch chipset is architected to work with packets up 
to 64K + 6 bytes long. On the ingress side of the switch, there 
are buses which are shared between multiple ports. For most 
packets, they are transmitted without any break from the start of 
5 packet to end of packet. However, this approach can lead to large 
delay variations for delay sensitive traffic. To allow delay 
sensitive traffic and long traffic to coexist on the same switch 
fabric, the concept of long packets is introduced. Basically long 
packets allow chunks of data to be sent to the queueing location, 
10 built up at the queueing location on a source basis and then added 
into the queue all at once when the end of the long packet is 
transferred. The definition of a long packet is based on the 
number of bits on each fabric. — The following table gives the size 
of long pack e ts for different switch sizes. 

15 TABLE 10 : — Long rack e t 3iz e 3 
Swi t ch Size Packe t Size 



(by t es) 

or\ 1 OAA 

w I UUU 

t -< m ttaa 

TXT7 Z. 7 1/1/ 

2 0 \ 60 3600 

"t a n e a r\r\ 

Zrr\j J'rvv 

a OA n/TAA 



If the switch is running in an environment where Ethernet 
MTU is maintained throughout the network, long packets will not be 
25 seen in a switch greater than 40G in size. 

A wide cache-line shared memory technique is used to 
store cells/packets in the port/priority queues. — The shared memory 
irs OK entri e s — x — 200 bit — wid e — running — crfc 125MIIz . Each memory 
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controller AOIC yields 25Gbps memory bandwidth. — The aggregator #9 
(control port) — generates at most — A — streams of OC - 40 traffic. — SHre 
enqueue and dequeue — speed for different — switch configurations — is- 
shown — i-rr — the — following — table . — Note — that — a — &x — speedup — cart — be 
5 achieved for all switch configurations except th e 400Gswitch. Up to 
234,057 cells can be stored in the 400G switch. The shared memory 
stores cells/packets continuously so that there is virtually no 
fragmentation and bandwidth waste in the shared memory. 



For? — t+re — short packets /cells, — memory utilization can be 
10 close — t-o — 100% . — For — the — long packets, — the memory block before the 

start — erf — a — long — packet — o&n — be — almost — completely — wasted. ¥he 

minimum — length — fro? — a — long — packet — i-s — 9 — cache — lines, — giving — art 

ef f ective — utilization — of memory — close — to — t4t% since — 1 — ovct — of — 4- 

memory cache lines can be wasted. — 

15 TADLE IP; — Shared Memory (1,630, 4 00 bits) in Each Memory Controller 



S wit ches 



n i 



Enqueue De q ueu e Speedup Cell Leng th 



n i 



Rstto 



Num b er of 

ri-ii- 

VII J 



TUVJ 



4.3Gb p s 



20.7Gbps 



391 1 bi t s 



JGb ps 



3Gb p s 



21 1 1 bi t s 



74 , 472 



I LKJKJ 
2 0 1 

a c*f\r* 

TUUVI 



S.OGb p s 

5.3Gb p s 

mi 

9.4Gb p s 



20Gb p s 
9.7Gb p s 
OGb p s 
5, OGb p s 



$ 1 1 bi t s 
12 i 1 b it s 
9 1 1 bi t s 
r j i 1 bi t s 



102,400 
126,030 
163,8 4 0 
23 4 ,057 



There exists ttp — to — zHiH3- multiple queues in the shared 
memory. They are per-destination and priority based. All 
25 cells/packets which have the same output priority and blade/channel 
ID are stored in the same queue. Cells are always dequeued from 
the head of the list and enqueued into the tail of the queue. Each 
cell/packet consists of a portion of the egress route word, a 
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packet length, and variable-length packet data. Cell and packets 
are stored continuously, i.e., the memory controller itself does 
not recognize the boundaries of cells/packets for the unicast 
connections. The packet length is stored for MC packets. — There is 
5 a limitation of 4K packets — (or cells) — in each of the MC queues. 

The multicast port mask memory 64Kxl6-bit is used to 
store the destination port mask for the multicast connections, one 
entry (or multiple entries) per multicast VC. The port masks of the 
head multicast connections indicated by the multicast DestID FIFOs 
10 are stored internally for the scheduling reference. The port mask 
memory is retrieved when the port mask of head connection is 
cleaned and a new head connection is provided. 

Two configurations of port mask memory are supported : 

a-: OK port connections, — for a 240 G switch 

15 fcr: 4K connections, — for a 400 G switch. 



Dequeue performance is restricted by several factors: — 3r)- 


Fadding injected by the aggregator ASICs; 2) Left alignment entries 




JJUO 

bus 


fragmentation — caused — by — t+re — multicast — connections ; — 4-) — Token 




latency — between — t+re — separators — and — t+re — memory — controllers; 


— 5* 

bus 


Separator — output — btts — padding; and — Unstriper — output — 


the 


fragmentation. A 4 00G switch is used as an example to analyze 
worst - case — performance — since — it — h«-s — most — padding, — overhead, 


and 



congest e d traffic -? 
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The aggregator ASICs have to pad a packet — (including 3G- 1 

bit route word, — variable length packet length field and datagram) 
to multiples — of — 1-2 — since — there — are — 12 memory controllers — in one 
fabric . — The shortest packet each memory controller — received is 7 
5 ferrt — long — since — a — packet — cem — be — a-s — short — as — 04 1 bit — long. — 54^ 

effective datagram is 3 bits. One entry will be — left aligned for 

every 1G 2 00 bit memory entries. — The left aligned entry can be as 
short as 1 - bit long. The worst - case datagram dequeue efficiency per 
output port of a memory controller is: 

10 (10 bit — (dout__me bus width) — * (3/7) — (datagram length in a shortest 
packet) — * — (15/1 0) — (left ' aligned overhead) ) — * 2 50MHz — (output — hvts 
speed) — * 12 — (number of memory controllers) — /-£4 — (number of output 
ports per separator) — - 502Mbps — 

"Pfre — best case — output — data — btrs — bandwidth per — separator 

15 channel — irs — 2-bit — * — 250MHz, — i.e., — 500Mbps. — In other — words, — ¥he 
worst -case dequeue bandwidth of a memory controller is bigger than 
the best case output bandwidth of a separator port. — 2x speedup can 
be — achieved through the — twice wide — output bus of the — separators . 
One — sync — cycle will be — fired on the — output bus — of — tfcre — separator 
2 0 every 120 cycles . — 

The output bus of the unstriper ASIC is — 04 ■ bit wide at 

100MHz . — It can only carry one packet per cycle. — In the worst case, 
up to 50 bits are wasted per pack e t for an OC4Q port. 

APS stands for a Automatic Protection Switching, which is 
25 a SONET redundancy standard. To support APS feature in the switch, 
two output ports on two different port cards send roughly the same 
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traffic. The memory controllers maintain one set of queues for an 
APS port and send duplicate data to both output ports. 

To support data duplication in the memory controller 
ASIC, each one of [[192]] multiple unicast queues has a 
5 programmable APS bit. If the APS bit is set to one, a packet is 
dequeued to both output ports. If the APS bit is set to zero for 
a port, the unicast queue operates at the normal mode. If a port 
is configured as an APS slave, then it will read from the queues of 
the APS master port. For OC48 ports, the APS port is always on the 
10 same OC48 port on the adjacent port card. 



Port mirroring is similar to the APG except that any port 



can pair with any port. — Only one pair of port mirroring ports 


are 


supported. — A 1G bit port mirror register is used to identify 




master and slave port involved in the port mirror operation. 

_ _ j_ . . „ j -i— -' / 1_ _■ .i_ -i r . r. \ - j_ i - 










either — have APG — enabled — or port mirroring — enable, — not both. 


the 


value — erf — t+re — port — mirror — register — eem — be — changed — on-fly — by- 
shadow registers . 





20 The shared memory queues in the memory controllers among 

the fabrics might be out of sync (i.e., same queues among different 
memory controller ASICs have different depths) due to clock drifts 
or a newly inserted fabric. It is important to bring the fabric 
queues to the valid and sync states from any arbitrary states. It 

25 is also desirable not to drop cells for any recovery mechanism. 
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A resync cell is broadcast to all fabrics (new and 
existing) to enter the resync state. Fabrics will attempt to drain 
all of the traffic received before the resynch cell before queue 
resynch ends, but no traffic received after the resynch cell is 
5 drained until queue resynch ends. A queue resynch ends when one of 
two events happens : 

1 . A timer expires . 

2. The amount of new traffic (traffic received after the resynch 
cell) exceeds a threshold. 

10 At the end of queue resynch, all memory controllers will 

flush any left-over old traffic (traffic received before the queue 
resynch cell) . The freeing operation is fast enough to guarantee 
that all memory controllers can fill all of memory no matter when 
the resynch state was entered. 

15 Queue resynch impacts all 3 fabric ASICs. The 

aggregators must ensure that the FIFOs drain identically after a 
queue resynch cell. The memory controllers implement the queueing 
and dropping. The separators need to handle memory controllers 
dropping traffic and resetting the length parsing state machines 

20 when this happens. For details on support of queue resynch in 
individual ASICs, refer to the chip ADSs. 

Multicast connections are enqueued into one of 4 priority 
queues based on the 2 - bit priority numb e r. — They are stored cache 
line based like the way unicast connections do. — Conn e ction numbers 
2 5 «md — lengths — a-r« — stored — into — one — of — 4 — lK - entry — p e r - priority 
connection FIFO. Multicast packets are subject to be dropped if th e 
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destined — connection — FIFO — ±-s — full . — 3rn — other — words, — srfc — most — 3rR 
multicast packets can be stored simultaneously for each priority. 

The — G4KxlG bit port mask memory will limit the number of 

multicast connections supported to G4K, — 32K, — 1CK, — i-GKr, — — and 4K 
5 for the 40G, 00G, 120G, 1G0G, 240G, and 400G switch, respectively. 

For the dequeue side, multicast connections have 
independent 32 tokens per port, each worth up 50-bit data or a 
complete packet. The head connection and its port mask of a higher 
priority queue is read out from the connection FIFO and the port 

10 mask memory every cycle (125MIIz) . A complete packet (or 50 bits if 
the packet — w — longer — than — 50 bits) is isolated from the 200-bit 
multicast cache line based on the length field of the head 
connection. The head packet is sent to all its destination ports. 
The 8 queue drainers transmit the packet to the separators when 

15 there are non-zero multicast tokens are available for the ports. 
Next head connection will be processed only when the current head 
packet is sent out to all its ports. 

For the worst case analysis, — us e the 400G switch as an 
example where the shortest packet is 7 bit long. — Every 0ns cycle 

2 0 only one connection can be handled — (bottlenecked by the connection 
FIFO and port mask memory) . — If the multicast only goes to 1 port, 
the effective dequeue throughput for the multicast connection is 
075Mbps out of available — 15Gbps shared memory dequeue bandwidth, 
i.e., — 6%-: In other words, — the multicast performanc e is severely 

25 damaged by the bottlenecks existing in the connection FIFO, — port 
mask memory, and head - of - line blocking. The throughput for the 400G 
switch is 400 + 7 + n/OO^n + 42G where n is number of copies a multicast 
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connection destined . In the worst — case where n-1, — the multicast 

throughput is about — 0% available switch capacity. — If the average 
multicast connections make 11 copies, — the switch can achieve 400G 
throughput . — 

5 The longer a packet is — (for the 240G switch or — smaller 

configurations) , — the more ports — a multicast — connection — destined, 
t-hre — dequeue — performance — becomes — better — significantly. — Multicast 
performance — do — irot — intervene — t+re — dequeue — speedup — for — unicast 
connections since the latter has their own tokens and two types of 
10 connections share the dout_me bus alternatively in a strict round - 
robin fashion, — i.e., — the multicast connections do not block unicast 
ones . 

There are 192 unicast queues, — 4 multicast queues, — and 4 

control port queues. — 4 multicast queues are per priority based and 
15 can broadcast to any subset of 192 output ports and the 4 control 
ports . 

There — a-re — «p — -bo — 1-9-6 — destination — channels — (192 — blade 

channels and 4 control ports) — for the 400G switch. — Each destination 
frers — a — one -to " - on e — mapped — unicast — queue ■ — 4 — multicast — queues — esrr 
2 0 broadcast to any subsets of 192 regular ports indicated by the per 
- connection based port mask entry. — An OC 192 port uses one out of 
4 queue locations. — Other thr e e queues are unus e d. All 0 bit fabric 
queue — ID field on the DestID bus — is used to identify one of — 1-9-6- 
ports . — 2 - bit priority field is unused. 



25 For the 240G switch, Up to 100 d e stination channels exist 

-(-9-6 — blade — channels — errrd — 4 — control — ports) . — 96 — unicast — destination 
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queues — have — 2 — priority — queues — each , — 4 — multicast — queues — earr 
broadcast — to any — subsets of — OG ports — indicated by the per con - 
nection based port mask entry. An OC 102 port — uses one out — erf — 4- 

queue locations. — Other three queues are unused. — Lower 7 bit queue 
5 ID is used to identify one of 100 ports and lower 1-bit of priority 
field is used to identify one of two priority queues in each port. 
Other queue ID bit and priority bit is unused. — 



For the 1G0G switch, — Up to GO destination channels exist 

-(-6-4 — blade — channels — and — 4 — control — ports) . — &4 — unicast — destination 

10 queues have 2 priority queues each. — There are — GO unused queues — 4- 
multicast queues can broadcast to any subsets of GO ports indicated 
by the per — connection based port mask entry. — An OC 102 port uses 
orre — ortt — of — 4 — queue — locations . — Other — three — queues — are — unused. — 
Lower 7 bit queue ID is used to identify one of 100 ports and lower 

15 1-bit — of priority — field — i-s — used — t-o — identify -one — erf — two priority 
queues in each port. — Other queue ID bit and priority bit is unused. 



Ptrr — the — 120G — err — smaller — switch, — Bp — tt> — 52 — destination 

channels exist — (40 blade channels and 4 control ports) . — 40 unicast 
destination queues have 4 priority queues each. — 4 multicast queues 

2 0 can broadcast to any subsets of — 40 ports — indicated by the per 

connection based port mask entry. An OC 102 port uses one out of 

4 — queue — locations . — Other — three — queues — are — unused . Lower — G bit 

queue — 5-B — is — used to — identify one of — D2 ports — and 2 bit priority 
field is — used to — identify one of — 4 — priority queues — in each port. 

25 Other queue — ID bits are unused. 
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Queue structure can be changed on fly through the fabric 
resync cell where the number of priority per port field is used to 
indicate how many priority queues each port has. 

The stripper ASIC resides on the network blade. — It has 
5 following features : 

Support packet/cell interfaces. — Can accept up to 3 GD/sec of 

sustained traffic — (3.2 GD/sec in bursts) — of cells, — frames, — err 
a mix of cell and frame traffic. 

- Generates fabric routeword for all fabrics in the switch 

10 Calculates data for the parity fabric and adds checksum to the 

end of each packet. 

Support switch configuration: 400,006,1206,1000, 2400, and 400G 

■ Generates — appropriate — signals — t-o — interface — directly — to the 
transmit side of the Gbit transceivers. 

15 The Striper takes DID cell/packet format from the ingress 

port ASIC . — For the ATM interface, — the AOX cell format is accepted 
from the Vortex ASIC of the Foseidon chipset — crfe — 2 . 5Gbps — for — t+re 
channelized blade . — Pfe — consists — of — 4 - byte — route — word, — 4 - byte ATM 
cell header — (without IIEC byte) , — and 40 byte payload. — 3G - bit — fc-he 

2 0 switch — route — word — can be — generated based on — the ASX — route — word 
provided by the Vortex ASIC. 

Wte — Striper — ASIC — consists — of — three — major — blocks : — the 

switch — route — word — gen e rator, t+re — switch — payload — 6 — checksum 

generator, — and the — switch parity generator. 
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Wre — switch payload generator — forwards — 4 byte ATM — cell 

head, — 4 0- byte ATM — cell — payload and 2 byte — checksum to — erp — t-o — 3r£ 
switch fabrics and 1 spare fabric. — The cell bus is 2x 12 bit wide 
running at 125MIIz. 

5 The Otriper ASIC duplicates the packet/cell and transmits 

various — fragments — to — the — fabrics . — 1-2 — data — output — buses — erf — t-he 
striper — ASICs — srre — connected — £o — t+re — data — input — buses — erf — Hre 
aggregator AOICs on the fabrics as follows : 

Figure 14 shows strip AOIC architecture. 

10 TABLE 2 0 : — Data bus conn e ctivity of the Strip e r AOIC of blad e #1 



15 

















(DOUT_ST_l_ 


40G (1 fabric) 


30G (2 fabricj) 


120G (3 fflhiiM) 


160G (4 fabi icj) 


Z40G (6 fflhi its) 


400G(I2 fabiitJ) 








DIN AG 1 1 cli 1 


DIN AG 1 1 eh I 


DIN AG 1 1 eh 1 


DIN AG 1 1 eh 1 


DIH,AG,lJ_ch_l 


DIN AG 1 1 eh 1 










■eellflhOl 


5.0] ee11[lliC] 


[3:0] ■ eell[ll:6] 


[2.0] cell[11.9] 


[1.0] cdlfll.10] 


;0]-eell[ll] 








DIN AG 2 1 tl) 1 


DIN AG 1! 1 eh 1 


DIN AG 2 1 eh 1 


DIN AG 2 1 ah 1 




DIN AG 2 1 eh 1 












5:01-ec)lf5;01 


3.0] aellf7.41 


2.01 eellfOiO] 


1:01 eell[9:01 


01 tdiriOl 








DIN AG 3 1 ell 1 


DIN AG 3 1 eh i 


DIN AG 3 1 eh 1 










DIN AG 3 1 eh 1 


■ccllf3:01 


2.01 CLllf5.31 


1.01 lcIIP-OI 


01'celim 






tht 










DIN_AG_4_l_ch_l 


DIN AG < 1 eh 1 


DIN AG_4_l_eh_l 


2.01 eelir2:01 


1-01 eelirs.41 


01-eeHfQ] 






tht 


tht 










DIN AG 5 1 eh 1 

■cciipai ~ 


DIN AG 5 1 eh 1 
01 tdl[71 






tht 


th 










DIN_AG_0_l_eh_l 
-eellf 1:01 


DIN_AG_6_l_eh_l 
01 eell[Gl" 






tht 








DlN_AG_7_l_th_l 






tht 








DIN_AG_0_l„chJ 






tht 








DlN_AG_9_l_eh_l 






tht 








DIN AG 10 1 eh 
l-eellf21 






tht 








DIN AG,ll_l_e1i. 
1 eellf 11 






tht 








DIN_AG_12_l_eh_ 




DIN_AG_jpJ_ch_ 1 


DIN AG jp 1 eh, 
1 f5.01 pnrityffiiOl 


DIN_AG_ap_l_eh_ 
ir3.of paiitj[3.01 


DIN AG jp 1 tli 
U2:01 parirrf2:01 


DIN_AG_jp_l_eh_ 
If 1.0] parityf 1.01 


DIN AG_jp_l_eh_ 
If 01 parity f01 
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54te — striper — ASICs — cm — blade — fri irs — connected — with 

aggregator — ASIC — fri — erf — a-ti — switch — fabrics . — "Ffre — striper — ASICs — cm 
blade — #-2 — is — connected — with — aggregator — ASIC — fr-2 — of — aii — switch 
fabrics. The striper ASICs on blade #4 is connected with aggregator 
5 ASIC #4 of all switch fabrics. The striper ASICs on blade #D to #0 
are connected with aggregator ASIC #5 to #0 of all switch fabrics, 
respectively . — The striper ASICs on blade #41 to # 40 nnected 
with aggregator ASIC #5 to #0 of all switch fabrics, — respectively. 
In other words, — blade number moduled by 0 — irs — the aggregator ASIC 
10 number which a striper ASIC is connected to. — 

The parity bits are sent to the spare fabric. The purpose 

of the spare fabric is to provide fault tolerance ability to the 
switch, — i.e., — in case one of the switch fabrics failed / — the spare 
fabric recovers the lost part of the cell. This is achieved through 
15 a — parity — b±-fc — generator — cm — the — striper — ASIC . — Perr — one — fabric 
configuration — the 12 bit cell payload is duplicated to the spare 
fabric; for 2 fabric configuration, G bit parity bits are generated 
as follows: 

parity bit (1 : G) cell bit(l:G) exclusive OR cell bit(7.12), 

2 0 Fcrr 3 ■■ fabric — configuration, 4 bit — parity — bits — a-re 

generated as follows : 



parity — bit (1 : 4) = — cell — bit (1 : 4) exclusive - OR — cell — bit (5 : 0) 

exclusive OR (9 12) ; 
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Wre — route — word — generator — regenerates — the — switch — route 

word and sends up to 12 i 1 1 bit 250MIIz route word buses for fabric 
1,2,3, . , — 12 and the spare fabric. 

The aggregator AOIC resides on the switch fabric as shown 

5 in the following figure. — Each 40G switch fabric has Oil aggregator 
AOICs . — It aggregates 0x4 separate cell streams and route words into 
a single 12G stream from up to G blades and 4 channels. — All input 
signals from the network blades are 250MIIz point to point IIGTL. — f-b 
outputs a single cell stream that is multiplexed with cell payload 
10 and route words to — 12 memory controllers. — The AOIC has following 
features : 

12Gbps Data and route word input — from up to G network blades 

and 4 — channels 
■ Route word separation and aggregation 

15 Output 12G data and route word to 12 memory controller AOICs 

IIGTL interface with the m e mory controller, — receiver interface 

for the backplane gigabit transceiv e rs. 

Figure 15 shows aggr e gator AOIC architecture. 

The aggregator AOIC supports 40G, 00G, 120G, 1G0G, 240G, 

2 0 arrd — 400G — switch — configuration — without — backplane — change . The 

backplane connectivity (DIN_AG buses) — of a pair of aggregator AOICs 
is shown as follows: 
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TADLE 21 : — DfNAG bus connectivity of aggregato r ASIC #1 and #5 of switch fabric #1 



PINAGJ_1_ehJ»u 
DIN AG 1 S th bu 


4QG (1 fabric) 


BOG (2 fabrics) 


120G (3 fabrics) 


I60G (4fabiitj) 


240G (6 fabi its) 


400G (12 fabrics) 




ukju i _u i _ i 

ll 1 Lftllf 11.01 


5.0] ttllfll.fl] ~ 




l~r V \J i. U» 1 V- 1 1 1 [ 


LS \S \J l tJ» I U I J 1 [ 


\J \J \J 1 _ U 1 I ^ I U 1 | 




3.0] eell[11.0] 


10] cell[11.9] 


1.0] e.ell[l 1.10] 


0] eell[U] 




rrh 


DOUT GT fi.eh 1[ 
5.01 ctllf 11.61 


DOUT GT 5 ciijf 
3:01 eellflltOI 


DOUT GT 5 eh 1[ 
2:01 celir 11:91 


DOUT GT 5 eh 1[ 
1.01 eellH 1.101 


01 eeiirm 




rth 














DOUT GT 0 eh 1[ 


DOUT GT 9 eh l[ 


DOUT_GT_9 eh 1[ 


DOUT GT 9 tli H 


3:O]-ccll[ll:0] 


10]-ecll[ll!9] 


l;0]-cc11[ll:10] 


0]-eell[ll] 
















D1H.AG J_fi,c1ijj[2: 0] 


DOUT_GT_13_eh_ 


POUT_GT_13_eh, 


DOUT,0T_1 3_c1i_ 
If 0] eelTflll 


1T2.01 cellf 11.91 


lfl.01 eeliri 1.101 


















DlN_AG_l_l_eh_3 


DOUT_GT_l 7_eh_ 
1 T 1 -01 eeliril.lOl" 


DOUT^OTJ 7_eh_ 
If 0] eelUMI 
















DlN_AGJ_5_ch_3 


DOUT_GT_2 
1 T 1 -01 tdlfll.101** 


DOUT GT 21 eh 
UOI cellflll 
















DlN_AG_1J_eh_4 


DOUT_ST^25_ch_ 
lfO] celirill' 














DOUT GT 29 eh 


DIN_AGJ ,5_eli_4 


ifoi--cdinn 














DOUT GT 33 eh 


DlN_AGJJ>hJi 


If 01 ■eeiirm 
















DlN_AGJ_5_eh_5 


DOUT^OT 37, eh 
If 01 eellflll 






tftt 










DIH_AGJJ_diJi 


DOUT GT 4 1 eh 
lfOI-eelirill 






rtfo 








DOUT ST 45 eh 


DlN_AGJ_fi_chG 


'f°1 



The 2x0 DIN_AG buses of aggregator AGIC # 1 and #5 pair 

erf — switch — fabric — fri — irs — connected to — H°re — 1-2 — x — DOUT_3T — btrs — fri — erf- 
blade — ¥±~, — &t — &7 — &n — ±=h — ^7 — — &i — — $=h — — atrd — 
respectively. — The 2 x G DIN_AG buses of aggregator AGIC #2 and #0 

2 0 pair of switch fabric # 1 is connected to the 12 x DOUT_3T bus #1 of 
blade #2, — 6-, — t^r — ±4-, — 3*7 — ^7 — 2*7 — *&7 — — 3#7 — 4*7 — and 4G, 
respectively. — The 2 x G DIN_AG buses of aggregator A0IC #3 and # 7 
pair of switch fabric #1 is connected to the 12 x DOUT_3T bus #1 of 
blade — ¥±- t — — ¥b-, — 3*7 — 3*t — — — 3*7 — — ^7 — 4*7 — and 47, 

25 respectively . — The 2 x G DIN_AG buses of aggregator AGIC # 4 and # 0 
pair of switch fabric # 1 is connected to the 12 x DOUT_GT bus # 1 of 

blade — — 8-? 3*7 3*7 — 2*; — 2*7 — 2#7 — 3*7 — 3*7 — 4*7 — &h — smd — % 

respectively. 



-79- 



Likewise, — the 2x0 DIN_AG buses of aggregator ASIC #1 

and #5 pair of switch fabric #2 is connected to the 12 x DOUT_3T 
bus #2 of blade #1, 5, 0, 13, 17, 21, 2D, 20, 33, 37, 41, and 45, 
respectiv e ly. — The 2 x G DIN_AG buses of aggregator ASIC #1 and #5 
5 pair of switch fabric # 12 is connected to the 12 x DOUT_ST bus #12 
of blade #1, — &t — $1 — 3^7 — — ^ — — &h — &n — 9^ — — and 45, 
respectively, — for the 400G switch configuration. — 

SHre — above — connectivity — ars — repeated — 4 — times — ftrr — t-he- 

channelized blades. 

io fot — the — — &eer — 120G, — igog, — 240G, — smd — 4-&ee 

configuration, — each blade — channel — sends — 1-2 — x — 3G bit — cell payload 
and 3G — bit route word, — G x 3G bit payload and 3G-bit route word, 
■4 — x — 3G-bit payload and 3G bit — route word, — 9— x — 3G bit payload and 
3G - bit route word, — 2 x 3G - bit payload and 3G bit route word, — and 1 
15 x — 3G "bit — payload — and — 30 -bit — route — word — t-o — each — switch — fabric, 
respectively . — in — other — words, — the — whole — 12 bit — wide — cell — w- 
transmitted in the same fabric for the 40G switch while only a 1 
bit wide — (1/12 cell) — cell slice is transmitted on each fabric for 
the 400G switch. 

2 0 The G0-bit D0UT_AG bus is split onto 12 memory controller 

ASICs, — each — receiving — 5 bit — data and — 1 - bit — clock signal — from one 
aggregator — ASIC. — Wte — 15 bit — DestID — btrs — — broadcast — to — a-ti — ir9r 
memory — controllers . — Due to the — fan out — load concern, — 3 copies — erf- 
the signals are maintained, — each driving 4 ASIC loads. 



25 Every channel of the aggregator sends up to 12x3x200 bit 

cell/packet — stream — to — 1-2 — memory — controll e r — based — on — a — work 
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conserving round robin dequeue algorithm, — i.e., — next source takes 
over — if the — current — source — runs out — of eligible — cells/packets — tro 
send . — Strict — round robin algorithm is used among 24 — sources . — Pcrr 
the 40G switch, only 4 source channels exist. — A source is eligible 
5 to send a cell/packet whenever a full cell or a full short packet 
or a 12x3x200 -bit segment of a long packet is received. — 



Each memory controller AOIC receives 9 independent cell 

streams from 9 aggregator AOICs. There are 0 250MIIz DIN_ME_fb_se 
buses, — each consisting of a 5 bit data bus, — a T bit clock signal, 

10 and a ID bit DestID bus. The GO bit DOUT_AG data buses of all 9 
aggregator AOICs — are bit — sliced onto — 12 memory controllers, — each 
receiving 5-bit data from one DOUT_AG bus. — Every memory controller 
gets a separate non - sharing clock signal — (named clkl to clkl2) — from 
each DOUT__AG bus to reduce the load of the clock pin while 3 memory 

15 controllers share a set of DestID bus from the D0UT_AG bus. — The 9 
DIN_ME_f b_se — buses — erf — memory — controller — 4Hfc — a-re — connected — tro — t-fre 
DOUT_AG buses of 9 aggregators as follows: — 

DIN_ME_fb_l_l_data DOUT_AG_f b_l_data [ 4 0 , 3G, 24 , 12 , 0 ] 

DIN_ME_fb_l_l_dest DQUT_AG_f b_l_de s 1 1 

2 0 DIN_ME_fb_l_l_clk DOUT_AG_fb_l_clkl 

DIN_ME_fb_l_2_data - D0UT_AG_f b_2_data [ 4 0 , 3 G, 24 , 12 , 0 ] 

DIN_ME_fb_l_2_dest -» D0UT_AG_f b_2__dest 1 

DIN_ME_fb_l_2_clk — DOUT_AG_fb_2_clkl 

DIN_ME_fb_l_3_data - D0UT_AG_f b_3_data [ 40 , 3G, 24 , 12 , 0 ] 

25 DIN_ME_fb_l_3_dest - D0UT_AG_f b_3_dest 1 

DIN ME fb 1 3 elk - DOUT AG fb 3 clkl 
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15 



DIN_ME_fb_l_4_data ^ DOUT_AG_f b_4_data [4 0, 3G, 24, 12, 0] 
DIN_ME_fb_l_4_dest - DOUT_AG_fb_4_destl 
DIN ME fb 1 4 elk - DOUT AG fb 4 clkl 



DIN_ME_fb_l_5_data -■ DOUT_AG_fb_5_data [4 0, 3G, 24, 12, 0] 
DIN_ME_fb_l_5_dest -■ DOUT_AG_f b_5_de a 1 1 

DIN_ME_fb_l_5_clk DOUT_AG_f b_5_cl kl 

DIN_ME_fb_l_C_data - DOUT_AG_fb_G_data [4 0, 3G, 24, 12, 0] 
DIN_ME_fb_l_G_d e st ^ DOUT_AG_f b_G_dest 1 

DIN_ME_fb_l_G_clk DOUT_AG_fb_G_elkl 

DIN_ME_fb_l_7_data DOUT_AG_f b_7_da t a [40, 3G, 24, 12,0] 

DIN_ME_fb_l_7_dest " DOUT_AG_fb_7_d e stl 
DIN ME fb 1 7 elk ^ DOUT AG fb 7 clkl 



DIN_ME_fb_l_0_data ^ DOUT_AG_fb_0_data [40, 3G, 24, 12, 0] 

DIN_ME_fb_l_0_dest DOUT_AG_f b_0_deat 1 

DIN_ME_fb_l_0_elk DOUT_AG_fb_0_clkl 
DIN_ME_fb_l_0_data " DOUT_AG_fb_0_data [40, 3G, 24, 12, 0] 
DIN_ME_fb_l_0_dest ^ DOUT_AG_fb_0_destl 
DIN ME fb 1 D elk ^ DOUT AG fb 0 elkl 



¥+re — DIN_ME — data — buses — of — memory controller — #-2 — srr* 

20 conn e cted to bit 40,37,25,13, and 1 of the DOUT_AG data buses of 0 
aggregators, — and so on. The DIN_ME data buses of memory controller 
#12 ar e connected to bit 00,47,35,23, and 11 of the D0UT_AG data 
buses of 0 aggr e gators. 
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12 memory controller ASICs aggregate cell/packet streams 

from 0 i 1 — aggregator ASICs . — Then write — the — cells — into — orre — erf — 2-6-8- 
output — queues — (e.g., — 1-2 — network blades — x — 4 — channelized — Poseidon 
interfaces x 4 priorities for unicast — i — 4 priorities for multicast 

5 h — 4 — control port queues) . The 0 bit destination queue number on 

the — DestID — btrs — is — used — ens — the — output — queue — indicator — for — the 
unicast — connection . — "Fhe — multicast — cell — irs — stored — into — orre — of — 4- 
priority queues based on the 2 -bit priority on the DestID bus. — "Phe 
lG - bit multicast connection number on the DestID bus will be used 
10 to lookup the internal port mask memory to find out the destination 
blade and channels during the dequeue phase. — 

The memory controllers send out cell/packet traffic from 

2-8-6 — output — queues — to — 8-Ht — separator ASICs. — Dequeuing — speed — is — as 
twice fast as enqueuing speed to reduce amount of cells buffered on 
15 the switch fabric. 

Support both variable - length packet switching and fixed length 

cell switching 

-• 12 ASICs are bit sliced and function as an integrated shared 

memory controll e r 

2 0 - Support 4-&&7 120G, 1G0G, 240G, arrd 400G switch 

configurations 

Enqueue cells/packets from 9 aggr e gator ASICs 
• 2 ' A dequeue sp e edup to 0 — s e parator ASICs 
On chip APS support 

2 5 234, 057 cella on chip buffer — 

- 200 programmable destination qu e ues 

On - chip control port support 
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G4K multicast connections, — 2^32 unicast connections. 

- Per queue transmit and loss counts 

Figure 10 shows memory controller AOIC architecture. 

ft — QKxl3-bit — link — list — is — used — to — maintain — free/used 

5 memory entry list pointer. A free entry is requested from the free 
link list when writing data into the shared memory and the current 
tail — cache — line — runs — ©trfc — of — space . — Complete — cell/packet — will — be 
dropped whenever the free list is empty, — i.e., — the shared memory is 

full . — A memory entry irs — free — to the — free list after the memory 

10 word — is transmitted to the separator AOICs. 

Figure 3^7 shows wide cache line shared memory 

architecture . 

DIN_ME_fb_se_9 — and — DOUT _ME_f b_s e_9 — buses — are — used — frcr 

connect to aggregator # 9 and separator #9, — which communicate with 
15 the control port striper and unstriper ASICs only. — It has the same 
DestID and cell format as other 0 buses do. — Its cells are enqueued 
and dequeued in the same way as the regular cells. — 

There are up to 4 — additional control port — queues . — They 

have queue ID from 192 to 190. All unicast connections having the 
2 0 control port queue ID as its fabric queue ID is enqueued into the 
relative — control — port — queue . — There — are — erfc — most — 4 — OC 1 2 — control 
ports supported. 
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Each — control — port — queue — h« — a — 13 bit — control — port 

register as follows: 



TABLE 22 : — 13 bit Control port qu e u e regist e r 



Bit 12:5 


Btt-4 






lilt 1 


nix n 
Ull U 






Control Port 3 enable 


Control Port 2 enable 






6-bit regular port ID 


Regular Port enable 






Control Port 1 enable 


Control Port 0 enable 



ft — queue — crsm — be — multicast — tro — ttp — t-o — 4 — physical — control 

ports — and — erne — regular — queue . — When — a — queue — is — redirected to — the- 
regular — queue , — that — queue must be disabl e d for the — regular — queue 
traffic . — Packets are queued in the same way as the regular queues 
10 ctei — i.e., — 200 - bit — cache — line based. — Left — aligned every — 3-6 — cach e 
lines . — Strict — round 1 robin — among — 4 — queues — wh e n — a — left -alignment 
entry is transmitted. A queue is routed to 4 control ports and one 
regular port based on the 5- ' bit control port enable vector. — 

Two dequeue algorithms are applied among 4 — control port 

15 queues : 

- — a-) One control port only talks to one cp queue: — Pure round- 
robin dequeue among 4 non empty control port queues which 
have non zero unicast tokens; one token worth unicast — ftap" 
to 200-bit) — is sent out to dout_me bus for a port; 

2 0 ■* br) ©rre — control — port — talks — to multicast — cp queues : — Strict 

priority — among — 4 — control — port — queues; — qu e ue — 1-9-2 — h-a-s- 
highest priority and queue 195 has lowest; — switch queues 
when the end of the pack e t — i-s — se e n . 
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OAM cells are identified by the Fabric queue ID field. If 

this field of a unicast connection has value OxFx(h), — then it is an 
OAM cell. — All OAM cells can be mapped into one of the 192 blade or 
4 control port queues set by a 0 bit programmable register — (called 
5 OAM cell destination register) . 

Resync cell — ( OxFF) — or any oth e r special cells with fabric 

queue ID set to OxFx are routed to any one of 190 qu e ues based on 
the OAM cell destination register too. — 

Per destination minimum and maximum thresholds and counts 

can be set — up to help memory management. — 200x2x14 bit thresholds 

(in unit of 200 bit entry) and 200 x 13 bit running counters — Hrrr 

unit — of — 200-bit — entry) srre — provided. ¥wo — additional — per 

destination — transmit — smd — loss — counts — (32 bit — each, — in — unit — of- 
packets) are also maintained. If the running count of a destination 
is above the relative thr e shold, — new packets are rejected and loss 
count increments . — Whenever dropping, — the whole packet is dropped. 

Otherwise, t+re transmit count increments . Perr multicast 

connections, — cells can also be rejected due to the multicast route 

word FIFO is full. — 4 additional FIFO full counts are needed. If a 

packet — i-s — dropped, — tire — whole — packet — i-s — cleaned — from — t+re — memory 
(including — t+re — segments — erf — a — long — packet) . — L H a re — thresholds — smd 
current counts are in unit of 200 bit cache lines. 

¥fre — minimum — threshold — (13 bit — valu e — plus — 1 bit — enable 

bit) is used to prev e nt shared memory starvation, i.e., every queue 
2 5 reserves — srfe — l e ast — the — number — erf — cach e — lines — indicat e d — by — t-fcre 
threshold. — The maximum threshold — (13 bit value plus — 1 bit enable 
bit) is used to prev e nt any single queue consuming th e whole shared 
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memory. These two thresholds cannot be changed unless there are no 
packets in the qu e ues. 

fs±± — counters — srre — 32 bit — wide . — They — are — reset — fro — zero 

automatically after reading. — Their values stick to OxFFFFFFFF if 
5 overflowed. — It takes 2^32 x Ons — - 32 seconds to overflow a counter 
in the worst case. 

The valu e of any thr e shold registers can be updated on - 
fly by a resync cell or a shadow control cell. — The content of the 
32 — bit shadow data register is copied to the location pointed by 
10 the shadow address register. 

The memory controller — can enqueue a single OC 192 — data 

stream from the aggregator ASIC and dequeue a single OC 192 data 

stream to the separator ASIC instead of AxOC 40 streams. — At the 

ingress — side, — the ASIC — receives — 4 — continuous — cells /packets /cache 
15 lines — from — frhe — same — source — channel — instead — of — 4 — channels . — tter 
special treatment is needed. — 

At the egress side, the Queue Drainer reads 4 cache lines 

from the shared memory for one destination after a token command is 
received foi: — the OC 192 port. — The RCD can send up to 4 — 200 ■ bit 
2 0 cache lines to the s e parator from the same destination queue. — Each 
OC - 192 port has 4 priorities for all switch configurations. 

The separator ASICs receive cell/packet streams from 12 

memory controllers, separate, and send them up to 40 network blades 
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through the backplanes. — The interfaces between the separator and 
the backplane are 250MIlz point to, point HOTL signals. 

Figure 10 shows the Separator AGIC architecture. 

Receive 12 data streams from 12 memory controllers 
5 Fabric synchronization 

- 24 1 destination — (blades and channels) — addressing 

Route word separation and aggregation 

- 0.2Dum 3V CMOS technology 

410 I/O pins 

10 140 bit 250MHz input; 240 bit 250MIIz output (at most 120 of 

them switch simultaneously) ; — 30 bit control signals 

£ i L he — separator — ha-s — twice — number — of — data — output — pins — srs- 

that of the aggregator AGIC to support 2X speedup. Gimilar to those 
of the striper AGIC, the AGIC supports 40G, 00G, 120G, 1G0G, 240G, 
15 and 4 00G switch configurations without backplane change. 

¥he — separator — AGIC — performs — rev e rse — function — of — ttre- 

aggregator — AGIC. — ¥fre — AGIC — receives — 120 bit — 250MIIz — cell/packet 
stream — from — orre — of — 8 — DOUT_ME_fb_se_bu — buses — erf — every — memory 
controller (12 of them) . 10 - bit blade and channel selection signals 
2 0 are used to select one of 24 destinations inside each separator for 
up to two cells. For example, the DINJ3P buses of separator AGIC #1 
is connected as follows : — 



-88- 



DIM_0P_fb_l_3^DOUT_MC_fb_3_l 

DIN_OP_fb_l_4^DOUT_MC_fb_4_l 

DIN_0 T_f b_l_D-^DOUT_MC_f b_D_l 

DIN_SP_fb_l_G-DOUT_MC_fb_G_l 

5 DIN_QP_fb_l_7-DOUT_MC_fb_7_l 

DIN OP fb 1 0-DOUT MC fb 0 1 



" B«f; 








_D-DOUT_MC_fb_9_l 




&i*h 




T&n 




10-DOUT_ME_fb_10_ 




- Bi¥h 

- Bitt- 


S2- 


-fir: 
-fb- 


■±2 
-3r- 


1 1- DOUT_ME_f b_l 1_ 

12- DOUT ME fb 12 


-3r 



- cn_gr_fb_i cn_MC_fb_i 

When a valid cell/packet — (channel ID is in the range of 

0 - 23 ) — i-s — received, — the — packet — type — field — am — t-hre — route — word — i-s- 
checked — first . — f-E — trt — i-s — art — — cell, — rre — packet — length — field — » 

15 followed. The length of cell payload is 3Gxl2/number of fabrics. If 
it is a packet, — the packet length bit immediately followed is used 
-bo — indicate — how — long — a packet — length — tbs-. — 0-12 1 bit — packet — length 
( including — this — bit) — and — 1-24 - bit — packet — length — (including this 
bit) . — The entire packet/cell is routed to the destination channel 

2 0 indicated by the channel — fBi — The invalid channel — 3rB — (bigger than 
■2-4-) — is used to indicate that the cell/packet is invalid. — 

The ASIC — then — separate — t-he — route — word and the — payload 

onto the route word bus and the data bus of one of G blades and 4 
destination — channels /unst riper — AOICs — based — cm — t+re — channel — H> 
2 5 signals. One 250MIIz 24 bit data bus yields GGbps data bandwidth for 
each channel. — Each route word is 2 bit wide running at 250MHz . — 
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L H°ne — connectivity — between — the — separator — ASICs — and — the- 

Unstriper ASIC3 are symmetric to those between the aggregator ASICs 
and — the — striper ASICs. — 54re — only difference — irs — that — arHt — data — and 
route word pins have double width to achieve 2X speedup. 

5 Data — received — from — each — destination — of — each — memory 

controller — hers — a — 1 bit — valid — ferrfc — accompanied. — There — srre — 2-4- 
destination — input — FIFOs — srre — used — — store — the — 1-2 — pieces — of- 
cell/packets — from 12 memory controllers — forr — 2-4 — destination blade 
•and — channels — in — each — separator, — respectively . — When — sriri — 1-2 — cell 
10 segments arrives, — the complete cell is sent to the relative output 
FIFO indicated by the channel ID. 

Like the striper ASIC, a 3 bit sequence number counter is 

maintained for the backplane synchronization. — It increments every 
3G 250MIIz cycles. — When a cell is sent to the unstriper ASICs via 
15 the backplane, — the current — counter is attached into the sequence 
number field in the 3G bit route word. 

I Phe — s e quence — number — counter — i-s — reset — by — the — global 

resynchronization logic . 

The unstriper ASIC takes — GGbps traffic from up to — 12 i 1 

2 0 switch — fabrics . — It — then — unstripes — the — cell — and — send — it — to — the 
egress netmod ASIC at 5Gbps or lower speed. 

■ Receive — GGbps route word and data from up to 12 1 1 — fabrics at 
250MHz — ftrt — OC40 — err — combine — 4 — chips — to — support — 26 — Gbps 
routeword and data from up to 12 1 1 — fabrics for OC192c 
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, Error — check — data — transport — throughout — t-he — switch, — detect 
corrupted data and perform data recovery 

■« Reconstructs cells/packets from the individual switch fabrics. 

Send G4 bit 100MHz data to the egress port AOIC for OC40, 25G 

5 bit for OC192c 

• Supports both UC and MC connection context for fabric data. 

Figure 19 shows the unstriper AOIC Architecture. 

Wre — unstriper — AOIC — receives — cells — from — op — t-o — 12 i 1 

fabrics , — each — running — &t — 2 50MIIz . — ft — us e s — the — following — steps — to- 
10 reconstruct good data. 

in — M-± — incoming — routewords — are — compared . — I-f — &rry — one — routeword 
disagrees , — that — data — lane — is — flagged as being — in error. — If more 
than one routeword disagrees , — the data is dropped. 

2. All valid input lanes are put through reconstruction logic which 
15 will attempt to build n i l candidate output data streams — for an N 
fabric switch. Any data lane which is not valid will invalidate any 
data lane which uses that data. 

•3-: — Aii — valid — reconstruction — lan e s — will — check — t+re — 6R€ — erf — t-he 
received data and on e passing output is selected. 

2 0 The striper remaps the s e parate routeword and data buses 

to a combined outgoing routeword — i data bus. 
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The following will detail the steps which happen at power 

trp — from — an — architectural — perspective . — Not e — that — when — expanding 
switch — capacity/ — t+re — additional — fabrics — must — be — brought — on-line 
before any new port cards are brought on line. 

5 Fabric Initialization 

i-. Port — cards — (unstripers ) — aire — initialized — tro — only — look — srb 

current fabric capacity and ignore oth e r fabric inputs. 

■zh Fabric is inserted, asserts its board present signal. Otripers 

start sending routewords to the new fabrics, — though they are 
10 ignored at this point. 

■3-: Doard — is — reset, M€H? — starts — tt> — boot — the — board. Def ore 

proceeding to the next step, — the MCP/GCP establish communica 
tion via the e net network. 

If the board is fabric 0 or the parity fabric, — the sync pulse 

15 transmitter is initialized. — (Actually sync pulse transmitter 

can be initialized on all fabrics, but it is only connected to 
DP signals if it is fabric 0 or the parity fabric.) 

•En MP — initializes — sync — registers — in — thre — aggregator, — memory 

controller, — and separator, — then initializes the registers — rrr 
2 0 t+re — sync — pulse — receiver . — 54°re — sync — pulse — rec e iv e r — starts — to 
look for a valid sync pulse. — The last sync setup is the sync 
pulse receiver, — so that all receivers on the chips are r e ady 
for the sync pulse from the sync pulse rec e iver. — L Phe — fabric 
chips run chip chip sync on the next backplane sync pulse. The 
25 MP should check to make sure the fabric has synchronized. — 3-f 

sync — hens — irot — been — achieved, — res e t — the — fabric — chips — arrd — re 
execute step -4 . 
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•6-. 3CF tells MT the current switch capacity window to use. — This 

is actually going to correspond to the current switch capacity 
(does — not — count — the — capacity — of — the — rrew — fabric — if — switch 
capacity is b e ing expanded) . 

5 HP — initializes — the — backplane — transceiver — networks — with — the 

current switch capacity — (both send and receive) — and initial 
izes — all registers except the aggregator — input enables . — &rry 

values — used — for configurable — options (which — ports are- 

OC4Q/OC192, — memory thresholds, — e tc) — need to be communicated 
10 -arrd — initialized — ert — this — point . — Certain — registers — axe — ini- 
tialized based on — the — switch board slot, — which needs to be 
known at this point. — From a software p e rspective, the biggest 
register — set which must be done — is — to update — the port mask 
table in the memory controllers to match the port mask table 
15 from another switch fabric. 

£h Aggregator — input — enables — &re — set — for — the — current — switch 

capacity. — This will start enqueueing traffic on this — switch 
board. The aggregators will need to s e e a bus idle followed by 
an increment in the transmit sequence number before starting 
2 0 to actually receive data. 

■9i 3CP sends a queue resync cell. — On cell return, — fabric queues 

are now synchronized. However, no valid data is being enqueued 
in the new fabric (s) — and the fabric outputs are being ignored. 

All unstripers must be configured to start utilizing the new 

2 5 fabric . — Since — queues — have — be e n — re synchroniz e d, — the — fabric 

dequeuing should be synchronized and no errors should be s e en. 
If errors are se e n, — clear them, — return to step 0. 
ttr-. — After — a-ti — unstripers — have — been — updated, — OCP tells — all port 
card MCPs to update strip e amount inside e ach of the striper 
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AGICs . — She — change — in — strip e r — configuration — will — start — the 
switch utilizing the additional capacity. 

•3r2-i After — aii — stripe — amounts — srre — updated — arrd — traffic — from the 

previous — strip e — amount — drained — from — the — switch, — then — the 
5 switch capacity needs to be updated. The only fixed time bound 

wery — of — ensure — traffic — from — the — previous — stripe — amount — i-s- 
flushed is to execute a queue r e sync. — If not all traffic has 
been flushed from the system with the previous stripe amount, 
the — switch will — drop — this — traffic — at — the — unstripers — (since 
10 there is no synchronization of the update at th e separators, 

the drop cannot be performed there) . 

Before — a — port — card — is brought — on - line, — arty — necessary 

switch — fabrics — must — be — brought — on line — first . — &s — per — the — switch 
standard convention, — port card installation happens in order. 

15 irai — The starting state has — sufficient switch capacity to support 
the new port card. — Aggr e gators are currently configured to ignore 
the input from any new board. 

i±n — Port — card is — inserted and asserts — its board present — signal . 
Port card sees sync pattern received from the fabrics. 

2 0 zh — I Phe — sync pulse — receiver — is — initialized. — The port — card — starts 
looking for a valid sync puls e on the backplan e . 

4-. — Dtriper — transmitt e r — i-s — set — op — for — the — appropriate — number — erf 
destination — fabrics — and the Gbit — network control — is — initialized. 
Defore the GDit networks are initialized, — th e fabrics cannot count 
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on seeing idle data from the new port card. At this point 1 ; — the 

port card can communicate its type — (OC4Q/OC192 ) — to the fabrics. 

Ssr: — Fabrics configure the port card type and enable the input from 
the port card. 

5 5-fen — Ot riper /unst riper — are — now — initialized, — along with — the — other 
chips on the board. — Gome enable in the inbound data path should be 
disabled . — The DID input enable in the striper can be used or — some 
other board specific input enable. 

— After — both — Set — and — &b — have — been — completed, — the — port — card — cmr 
10 enable its input side and start sending data to the fabrics. — Note 
that — in — general, — further — software — configuration — will — need — to — be 
done after this point — (such as setting up inbound lookup entries) . 
The completion of 5a is necessary to ensure the — fabric queues do 
not go out of sync. 

15 — First data from the port card is striped to all fabrics. 

0. When a port card is removed from the system, — not very much needs 

to happen from a hardware perspective. Defore the port card goes 

away, — it transmits a packet abort which will cause any incomplete 
packets in the egress side to the dropped. — Traffic will be drained 
2 0 from — the — memory — queues — which — correspond — to — the — affected — output 
ports . 

•9-. "Po — remove — a — port — card — from — the — switch — logically, — software 

should disable the striper output bus. 
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Fabric deactivation — irs — similar — to — fabric — activation — ±-rt 

reverse . The steps include: 

i-. — Owitch capacity is being removed. — If port cards are present in 
the switch which are paired with the fabric capacity which is about 
5 to be removed, — those must first be deactivated. 

zh — Program the r e maining stripers in the system to stripe data to 

one less stripe amount than the current configuration. This will 

stop sending real data to the fabric about to be decommissioned. 

3- ; Oend a queue resynch. This will flush out any traffic at the 

10 last stripe amount. 

4- . Program the — unstripers — to — start — ignoring — t+re — data — from the 

fabric which is about to be removed. 

5- : The fabric can now be physically removed from the system, — or- 

logically — remov e d — from — t-hre — system — by — disabling — rtrs — inputs — and- 

15 outputs . 

The reason for the queue resynch step is not because the 

switch — i-s — otrb — erf — sync . c H°re — unstriper — will — treat — t+re — receipt — of- 

traffic which is striped to more fabrics than physically present in 

t+re — switch — srs — art — error — arrd — increment — error — counts . 54°re — queue 

20 resynch — ensures — that — the — error — counts — on the — unstripers — will — rrot- 
increment unnecessarily . 
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ir. — Flush out — traffic — from the port — to be — converted over — to AFO. 
Initialize anything in the separator as required for the new output 
port combination . 

Sh — Write to the AFO enable bit using the shadow register in every 
5 memory controller for the output port being affected. The main port 
for AFO is not affected. — Either a higher or lower number port can 
be the primary port and the backup port. — AFO is always enabled on 
the backup port. 

61 — Send either a queue resync cell or a shadow control c e ll to all 
10 memory controllers . 

4. Memory controllers start to dequeue after the next left aligned 
cache boundary — (if the previous transfer for this port was left 
aligned, — it will be remembered) . 

Note that in all this process, the queue number was nev e r switched. 
15 £ E L fre — switch — will — not — support — a — seamless — port — swap — dtre — to — AFO 
activate /deactivate . — (In other words, — AFO can be turned on port 0, 
which will cause port 0 to mirror port — i-6~. — However, — AFO cannot be 
turned off on port — ir€ — since it — is not — em — Traffic — irs — only being 
changed for the port where AFO is added.) 

20 The following words have reasonably specific meanings in 

the vocabulary of the switch. Many are mentioned elsewhere, but 
this is an attempt to bring them together in one place with 
definitions . 
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TABLE 23: 



Word Meaning 

APS Automatic Protection Switching. A sonet/sdh standard for implementing redundancy on physical links. 

For the switch, APS is used to also recover from any detected port card failures. 
Backplane A generic term referring either to the general process the the switch boards use to account for varying transport 
5 synch delays between boards and clock drift or to the logic which implements the TX/RX functionality required for 

the the switch ASICs to account for varying transport delays and clock drifts. 
BIB The switch input bus. The bus which is used to pass data to the striper(s). See also BOB 

Blade Another term used for a port card. References to blades should have been eliminated from this document, but 

some may persist. 

BOB The switch output bus. The output bus from the striper which connects to the egress memory controller. See 

also BIB. 

Egress This is the routeword which is supplied to the chip after the unstriper. From an internal chipset perspective, 

1 0 Routeword tne egress routeword is treated as data. See also fabric routeword. 

Fabric Routeword used by the fabric to determine the output queue. This routeword is not passed outside the 

Routeword unstriper. A significant portion of this routeword is blown away in the fabrics. 
Freeze Having logic maintain its values during lock-down cycles. 



Lock-down Period of time where the fabric effectively stops performing any work to compensate for clock drift. If the 
backplane synchronization logic determines that a fabric is 8 clock cycles fast, the fabric will lock down for 
8 clocks. 



1 5 Queue Resynch A queue resynch is a series of steps executed to ensure that the logical state of all fabric queues for all ports is 
identical at one logical point in time. Queue resynch is not tied to backplane resynch (including lock- down) 
in any fashion, except that a lock-down can occur during a queue resynch, 
SIB Striped input bus. A largely obsolete term used to describe the output bus from the striper and input bus to the 

aggregator. 

SOB One of two meanings. The first is striped output bus, which is the output bus of the fabric and the input bus 

of the agg. See also SIB. The second meaning is a generic term used to describe engineers who left Marconi 
to form/work for a start-up after starting the switch design. 

Sync Depends heavily on context. Related terms are queue resynch, lock-down, freeze, and backplane sync. 

Wacking The implicit bit steering which occurs in the OC192 ingress stage since data is bit interleaved among stripers. 
This bit steering is reversed by the aggregators. 



2 0 £ E4 a re — Aggregator — Receive — Synchronizer f s — function — is — to 

maintain logical c e ll/packet ord e ring across a-ti fabrics . 

Cells/packets arriving at more than one fabric from different port 
cards — need — to — be — processed — in — toe — same — logical — order across — all 
fabrics . If cell/packet logical ordering is not maintained, — then 

25 cells/packets — coming — otrb — of — fabrics — will — have — stripes — of — a- 
particular cell/packet not match up and will not be able to be re 
ass e mbled by the Unstrip e r. 
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Logical — cell/packet — ordering — needs — to — foe — maintained 

across the following conditions: 

Transport — delay — variances — between — cme — source — and — multiple 
destinations 

5 - Clock drift across transmitters and receivers 

* Insertion and removal of port cards and fabrics 

- Port card errors such as no sync, — no lock - downs , — too fast/too 

slow, — routeword parity errors 

Gigabit transceiver errors such as loss 1 of - lock, — data errors 

10 ■ Non synchronized updates to Gigabit network 

OC192c — data — streams (aggregating — 4 — channels — to — make — erp — errre 

OC192c stream) 

c Phe — switch — uses — a — system — erf — transmit — artd — receive 

counters . "Pfre — counters — allow — a-ti — components — tti — t+re — system — to 

15 logically — align — themselves . 54re — Piaster — Sequence — Generator 

implements these two counters that will count continuously from *0' 
to — i -3- z — and will increment every x 125 MHz clock cycles where, — x is 
the counter tick length as programmed by software. — x is currently 
calculated to be 250 cycles. — This is based on analysis done in the 

2 0 Dackplane — Oynchronization — AD3 . 5*re — relationship — between — tire 

transmit — and — receive — counters — cart — foe — seert — irt — Figure — 2-6-: 6rre 

counter will be used by the transmit synchronizers in the Otriper 
«md — Separator — ASICs — and — the — other — counter — will — foe — used — in — the- 
receive synchronizers in the Aggregator and Unstriper AOICs. Wte 

25 receive counter will be a delayed version of the transmit counter. 
The amount of delay is programm e d by software .in the Oync — Fulse 
Receive — Delay — register. This — regist e r — determines — the — number — erf- 
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clock cycles that the receive counter waits before incrementing its 
own counter relative to the transmit counter. — This register should 
always be non - zero since the transmitter will have no delay and the 
receiver n e eds to be delayed with respect to the transmitter. — Wre- 
5 Oync Pulse Receive Delay has been estimated to be 150 cycles. — 'Fhe- 
delay — i-s — approximated — equal — to — t-he — worst — case — transport — delay 
between — transmitter — and receiver plus — worst — case — transport — delay 

variance — of — the — sync — pulse . L H°re — delay — also — takes — into — account 

worst case fast and slow transmitters and receivers. 

10 The Gync Pulse Period is defined as the number of cycles 

between sync pulses. — It is extended slightly by about 10 cycles in 
order — — art — to — appear — late — im — the — 4H — window — of — each — AGIC s 
sequence count. — This is done to ensure that every AGIC will appear 
to — be — running — too — fast — even — rf — they — are — actually — running — slow 

15 relative to the clock that generated the sync pulse. If this was 

not — done, — the — sync pulse — could appear — in the — — window — and the 

AGIC would consider itself to be slow. There would be no way for 

it to catch up. Each transmitter and receiver will calculate the 

difference — between — when — the — sync — pulse — arrives — artd — when — Ttrs — own 

2 0 counter transitions from — to — A 0' . — This difference is the number 
of — cycles — that — irt — rs — fa-st — and — irs — r e ferred — to — a-s — the — lock down 
amount (z in figure) . — Once a transmitter determines it should lock- 
down for z cycles , — it will finish sending valid data during its — HH- 
window and then lock down z cycles. During the lock 1 down period, 

2 5 tro — valid or — idle — dat a — irs — s e nt . Instead, — a — sp e cial — lock - down — K 

character is transmitted which will be recognized by the receiver. 
The receiver will not write the lock down characters into its input 

FIFOs . This — will — ensure — that — the — input — FIFOs — can' t — overflow. 

Gince th e sequence counter does not advance for the amount of lock- 
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down, — it is effectively resetting itself to the sync pulse. It is 

equivalent of having the sync pulse appear at the start of the — HH- 
count — window — since — the — transition — t-o — a — count — erf — — occurs 

precisely one tick length after the sync pulse arrives. When the 

5 next — sync pulse arrives, — if — clock frequencies — are — constant , — then 
the — sync — pulse — should — appear — in — the — HF — count — window — and — the- 
calculated — lock -down — amount — will — be — the — same — sre — the — previous 

calculation . This — allows — the — system — t-e — always — expect — the — sync 

pulse arrival in the — HF — count window even if the clocks generating 
10 the — sequence counter are too — fast or — too slow. 

fhe — Receive — Synchronizer — block — will — tree — the — sequence 

counter — "be> — determine — when — t-e — accept — data — from — input — byte — sync 

FIFOs . Once — a — sync character — rs — read, — pops — from the — FIFOs — will 

only occur — once the — sequence — counter — transitions — from "0" — to — u-j-u- 

15 and — immediately — following — an — arrival — of — a — sync — pulse . SHre — read 

decision is only made once every sync pulse arrival and only at the 

— t-o — *Hr^ — transition — erf — the — receive — s e quence — counter ■ The 

sequence — counter — ars — also — used — during — fabric — resync — irt — order — ter 
communicate — a — fabric — resync — t-e — etti — channels — irr — eriki — aggregators 

2 0 during — a — sequence — count — transition . Fabric resync cells — will — be- 

transmitted — at — the — beginning — erf — a — sequence — tick — window — artel — are 

prefixed — by — a — special — character — indicating — a — resync — cell . The 

receive — synchronizers — in — the — Aggregators — will — resynchronize — all 
data — going — t-e — the memory — controllers — em — the — next — sequence — count 

2 5 transition once the — r e sync character has been received. 

A block diagram of the r e c e ive Synchronizer can be seen 

-rn — Figure — zHn The — Rec e ive — Gynchroni - ze - r — consists — of 24 — Dyte sync 

FIFOs, — a Crossbar and G Dus Synchronizers. — There is one byte sync 
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FIFO per — gigabit receiver . Each byte sync FIFO will accept data 

from each gigabit receiver independent of the mode of the switch. 

The byte — sync FIFO depth — is about 25G words deep . This depth — » 

based on a derivation found in the .Dackplane Synchronizer AD3 . — l 5 L he 
5 Crossbar will handle the assignment of the appropriate input byt e 
lanes to the correct channels. — Each Dus Synchronizer will consist 

of four Channel FIFOs and one Dus Controller. The Dus Controller 

can — handle — 4 — separate — OC40 — channels — or — one — OClD2c — stream. 34°re 

channel — FIFO is about — 1-6 — words deep. The depth is based on the 

10 number of words to r e ad a 3G bit routeword. — The whole routeword is 
read and then pres e nted to the rest of the Aggregator in one cycle 
since it needs to be stored before the data of the packet as it is 
constructed and sent to the memory controller. 



Multiple gigabit receivers make up a 24 bit data bus and 

15 2 - bit routeword bus for one channel of an Aggregator. — Each gigabit 

receiver can handle up to 0 bits. Due to varying transport delays 

that can exist between receivers; — bytes — from different — receivers 

that belong to the same word can be skewed from each other. Fcrr 

example, — the — 24 -bit — data — brrs — and — 2 bit — routeword — btra — fcrr — one 
2 0 channel — erf — an aggregator will have — 4 — receivers — that make — up th e 
bus . — The synchronization logic will align all 4 bytes for the 2G 
terrt — fotrs — and will — pass — this — byte — aligned word — to — the — rest — erf — the 

Aggregator . In order to align the bytes, — the Otriper will need to 

send — a — special — alignment — byte — to — each — rec e iv e r . A — special — K 

2 5 character — ean — be — utilized — from th e — gigabit — transceivers . Wre — K 

character — will — be — encoded — am — the — data — bits — on — the — Gigabit 
transmitter and will be det e cted on the Gigabit receiver. — 
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The receive synchronizer in the Aggregator will con&i&L 

of 24 — FIFOs where there — is one — FIFO per Gigabit Receiver. These 

FIFOs will handle both byte alignment and t+ns backplane 

synchronization . — It is assumed that the Gigabit Receivers will be 
5 able to distinguish between valid, — idle , — sync and lock down cycles 
and will indicate these various cycles to the Aggregator by using 
3 control signals . 

On startup, — the FIFOs will be empty and each Write State 

Machine (WDM) — will wait until a sync character is seen on its input. 

10 From this point on, every cycle will be pushed except for lock " ' down 

cycles — from — t+ns — fabric. When — the — fabric — i-s — locking — down, — t-fre- 

Stripers will send special lock down characters. This is done to 

avoid overflowing the syne FIFOs in case the write side clock is 
faster than the read side clock. While particular types of words 

15 are being pushed, the word type will also be written to the FIFO so 
it can be distinguished on the read side. 

The WOM is also looking for a special fabric resync cell 

K character that will indicate that a fabric queue resync cell will 

immediately follow . If a resync cell is detected, — a resync signal 

2 0 is passed along to Dus Controller. The Dus Controller will then 

tell other Aggregators on the fabric to resync their queues at the 

next — transition of — t+re — sequence — counter . Fabric queue — resync — ins- 

described in mor e d e tail later. 

Gigabit receiv e rs are not dedicat e d to particular input 

2 5 channels, — but instead shar e d between various channels. Each byte 

sync — FIFO works — independently of the — switch mode and each — input 
lane needs to be ste e r e d to the correct channel FIFO. — For instance 
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in 4 0 mode; — 20 bits of data and routeword are required for Dus 1, 
channel A and therefore 4 byte lanes are required to be steered to 
each channel of Dus — in — In 00/120 mode, — only 0 — bits — of data — and 2 
bits — erf — routeword — are — required — and — therefore — two — bytes — will 
5 suffice . — In 4 00 mode, — only 4 bits are required per channel and one 

byte — lane — will — suffice . A-s — switch — capacity — increases, — less — and 

less byte lanes will be required for a particular channel. — For all 
switch — modes, — the — routeword bits — for — a — particular — channel — will 
always come from the same byte lane. — As the byte lanes get reduced 

10 from 4 to 1 byte lanes, — there will always be one common byte lane 
used to carry the routeword data lines. — The crossbar will take in 
24 lanes consisting of 0 bits of data and 3 bits of control along 
with — other — control — signals — to — communicate — with — the — Btrs — Control 
logic. It will then forward all these signals to the appropriate 

15 channels . The Crossbar will also accept control data from the Dus 

Controller and forward signals such as read requests and FIFO flush 

signals to the appropriat e input byte sync FIFOs. Each crossbar 

mapping between input byte lanes and channels is bi directional. 

The Dus Controller consists of thre e stat e machines. — Wre 

2 0 state machines control the read side of th e byte sync FIFOs, — the 
write — side — of the chann e l — FIFOs — and the — read side of the Channel 

FIFOs . On the read side of the Dyte FIFOs, — pops will not commence 

until a sync pulse has arrived and the r e ceive sequence counter has 

transitioned from "0" to "1". A signal will be provided from the 

25 sequence generator block that indicates a "0" to "1" transition at 

precisely — this — moment (sync_event) . At — this — time, the — Dus 

Controller — issues — a — read — to — the — Crossbar — fcrr — the — particular 

channel . ¥he — Crossbar — then — forwards — the — read — signal — to — the 

appropriate byte sync FIFOs based on the mode of the switch. The 
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Crossbar then forwards all data and control — from these byte 57111 

FIFOs — back — to — the — Btrs — Controller — fcr — this — channel . The — Btre 

Controller checks the data types to make sure that the first word 

in the — appropriate byte — sync — FIFOs — are a — sync character . If the 

5 first word of any of the appropriate byte lanes for this channel is 

not — a — syne — character, then — a — sync — error — will — be — flagged, 

appropriate byte sync FIFOs will be flushed and the synchronization 

process — will — be — re initiated. ff — the — first — word — ±15 — a — sync 

character, — then — pops — will — continue . fn — 0C4Q — mode, — this — process 

10 will be performed independently for each channel. 0C1 02c support 

is discussed later on. 

Once data starts being read from byte sync FIFOs, the Dus 

Controller — will — ignore — data — until — it — finds — the — first — idle — word. 
Once an idle word has been found, — it can now start looking for the 

15 — indication — in — the — routeword — when — the — ne?rfe — non idle — word — is- 
read . — The rest of the routeword is processed and made available to 

the — rest — erf — the — Aggregator . ff — the — stop — bit — in — the — r outeword 

indicates — that — the — pack e t — is — continuing, then — data — will — be 

continuously — made — available — to — the — Aggregator — until — a — stop 

2 0 indication is — read. Not e that even though a OOP is — seen, — it does 

not mean that — this — segment — is — the — first segment — of a packet. ft- 

can be any segment of a packet. — Even though the segment may not be 
the first one of a packet, — it is allowed to go through the switch 
and will be dropped later on. 

2 5 When a sync character is read, — a counter is initialized, 

The — counter — counts — e ach — read from the byte — sync — FIFOs . She — Btrs- 

Controller — will — expect — to — see — a — sync character — every — sync pulse 
period (about 22,000 cycles). — If a sync character is read too early 
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or too late, — then a sync error is flagged, — data is dropped at the 

precise — logical — cycle — erf — where — a — sync — character — « — expected . ft 

packet that is being processed at the theoretical logical cycle for 
sync — will — fere — terminated — and — inputs — will — be — disabled — until — re- 

5 enabled by 0/W. For example, — if after the first sync character, 

t-he — next — sync character — occurs — at — cycle — 19, 000, — and then a — sync 
error is flagged. — Data is not dropped until 22, 000 reads have been 
performed. — Also, — if after the first sync character, — the next sync 
character is not receiv e d at all after 22,000 cycles, — then a sync 
10 error is flagged and data is dropped at this precise logical cycle. 
If a sync character is received precisely 22,000 cycles after the 
last one, — then reads from the byte sync FIFOs are stopped until the 
receive sequence counter transitions from *0' — to — ^ 1 ' . — Waiting for 
t-he — to — Hr* — transition — will — ensure — that — edri — fabrics — are- 
IB receiving the same stripe of a packet on the same logical cycle. 



For OC192c, 4 input channels need to be concatenated into 

one OC192c stream. In this mode, — the Dus Controller will control 

all 4 channel FIFOs and the appropriate byte sync FIFOs. — Data type 
checking will be performed across 4 times as many byte lanes as in 

2 0 the OC4 0 — case . When it — is time to read byte sync FIFOs, — the Dus 

Controller will control 4 read control lines to the Crossbar. Wre- 

Crossbar will initiate reads across all appropriate byte sync FIFOs 
that are required for OC192c and will present data back to the Dus 
Controller . — The Dus Controller will check data types and will look 

2 5 for GOP indications 1 : — Th e OOF indication and stop bits will only b e 

found — in — t+re — Routeword — fot — channel A. 54°re — Brrs — Controller — will 

write all — 4 channel FIFOs at the same time when writing data and 
will present the complete OC192c Routeword in one cycle to the rest 
of the Aggregator -: Wte — functions — of the — Btrs — Controller — will be 



5 
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identical for OC40 and OC192c except that all 4 channel nrOa will 
be controlled when in OC192c mode. 

Special — cas e s — can — be — broken — down — into — the — following 

categories : 

5 Port card insertion 

irz Port card removal 

■zh Port card errors including : 

No sync character 

Eh Port card not locking down 

10 Si Routeword parity errors 

Eh Garbage data 

Eh Port card sending data too fast or too slow 

•3-. Fabric Queue resync 

4-. Non synchronized updates to Gigabit network 

15 When — a — port — card — i-s — inserted/ — the — port — card — present 

signal — will be asserted and — sent — to each — fabric. Not — until — S-/W 

enables the particular inputs and the Aggregator sees the port card 
present — signal, — will the Aggregator — be — r e ady to accept data — from 
the new port card. Once enabled, — the Aggregator will go through 

2 0 the process of looking for sync characters on individual byte lanes 

associated with the new port — card. ft — is assumed that — the port 

card will not send any data until it has been configured only after 

t+re — fabrics — have — been — initialized . Onc e — tire — port — cards — are 

enabled, — they will — start — sending — sync — charact e rs periodically at 

25 every — global — sync — pulse — arrival . ft — is — important — that — a-fi — the- 

appropriat e fabrics see th e sync character from th e particular port 
card — since — some — fabrics — will — be — initialized — lat e r — than — others :■ 
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After sync characters have been received/ — all data will be written 
on each cycle excluding lock - down characters. 

When — a — port — card — irs — about — to — be — removed, — the — enable 

switch on the port card will be turned off. This will signal the 

5 port card to finish sending valid packets and then send idles. — The 
port card will send a packet abort k character to indicate that no 
more — valid — packets — will — be — sent — immediately — following — the — last 

valid packet. It is assumed that when the port card is actually 

removed, — it will have already sent the packet abort k character. 

10 This is critical for the fabrics to keep their queues in sync. ft- 

is important that each Aggregator on each fabric that handles the 

particular port card stops forwarding — data to t-he — memory 

controllers at precisely the same logical cycle. — The WDM will stop 
writing — data — into — the — byte — sync — FIFOs — onc e — the — packet — abort 

15 character — irs — seen . 54re — Bvrs — Controller — will — terminate — the packet 

once the packet abort character is read out of the byte sync FIFOs. 

Case A : — No sync/early sync/late sync from port card. 

Solution : — The Oynchronizer will — look for a sync at precisely the 

same — logical — cycle — each — time . This — will — occur — every — sync — pulse 

2 0 period that — irs — approximately 22,000 — 125MIlz — cycles . f-£ — the — sync 

character — irs — not present — at the head of the byte — sync — FIFOs when 
22,000 cycles have been read since the last sync character, — a sync 
error will be flagg e d and data will be dropped the cycle where the 
sync character should have been. — All fabrics n e ed to drop data at 

25 precisely the same logical cycle for this particular input lane. 1 
Inputs for this particular channel will be turned off and the byte 

sync FIFOs used for this — chann e l will be flushed. D/W will turn 

trf-f — the — offending — Otriper . Inputs — will — be — ignored — until — S-f-tt 
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enables these inputs again. — If a sync character arrives too early, 
then data should be dropped at precisely the cycle where the early 
sync was read. — Other Aggregators will make the sam e drop decision 

-rf — this — error — ±-s — common — feo — a-ti — fabrics . f-£ — t-he — sync — character 

5 arrives too late or not at all, — then the drop decision will be made 

where — t-he — sync — character — wa-s — expected. ¥tre — sync — character — irs- 

expected to arrive every 22,000 cycles after the last sync. 

Case D: — Port card not locking down. 

Solution : If the port card does not lock down, — it will then send 

10 more than the ideal number of valid and idle cycles between sync 
characters . — This will be caught by the same logic that checks for 

sync — characters — in — the — correct — logical — cycles . Data — will — be 

dropped the — same way as — in the case where no sync came from the 
port card. 

15 Case C : — Routeword parity errors. 

Solution : — If a parity error is detected for a particular routeword, 
the packet will be terminated at th e bad segment and a parity error 

will — be — flagged. Data — will — be — dropped — after — this — terminated 

segment is forwarded to the rest of the Aggr e gator and FIFOs — fer 

2 0 this particular channel will be flushed. Inputs will be disabled 

until — r e enabled by S/W. 

Case — B-: — Garbage data from port card while all — fabrics already in 
sync. 

Solution : — If the data is unrecognizable by the gigabit receivers, 
2 5 errors will b e form e d and provid e d to the Aggregator by the gigabit 

receivers . ftt — the point — erf — error, — data — b e ing — written — into — byte 

sync FIFOs will be flagged to be in error. If the Dus Controller 
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sees — that — the — particular — byte — lane — in — error — irs — not — used — for — the 
Routeword bits, then the error will be flagged but the data will be 

passed on to downstream logic. This — is — consider e d to be — a — soft 

failure since queues will still be able to stay in sync. If the 

5 Dus Controller sees that the particular byte lane in error is used 
for the Routeword bits, — then the packet will be terminated and then 

dropped once the erred word is read from the byte sync FIFO. 54°re 

input will be disabled, a gigabit receiver error will be flagged to 
0/W and byte sync and channel FIFOs associated with this channel 

10 will be flushed. — This is considered to be a hard failure. If the 

failure occurs only for one fabric, then other fabrics can still be 
used to re assemble the packets. — 3/W will have to queue resync the 
bad fabric. — If this error occurs across multiple fabrics, — not much 
can be — done to avoid fabric queues — from becoming corrupted. S/W 

15 will then have to queue resync all fabrics. 

Case — Er. — Port — card — sending — data — too — fast — or — too — slow . ft — i-s- 

possible that the port card is sending the correct number of valid 
cycles between sync characters but — is not — locking down enough or 

locking — down — too — much — during — each — lock down — period . Dyte — sync 

2 0 FIFOs can eventually overflow or underflow respectively. If more 

than one fabric have FIFOs that overflow or underflow and data is 
dropped — at — different — logical — cycles — for — the — sam e — source, — then 
fabric queues can becom e out of sync. 

Solution : — This — rs — considered a — hard — failur e — sinc e — ±-fc — should not 

2 5 occur — rf — the — hardware — rs — working — correctly. "Fhe — only — vnzy — to 

possibly prevent — this — is — to flag an error — if the — FIFOs — reach an 
almost full or almost empty threshold. — This is a warning sign that 

something — i-s — wrong . S/W — will — then — turn — off — the — off e nding — port 

card . Data will continue to be written to and r e ad from the byte 
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syne FIFOs as if nothing is wrong. If the port card can be turned 

off and idles be sent before byte sync FIFOs overflow, — then there 

will be no dropped data and fabric queues will stay in sync. I-f- 

FIFOs overflow or underflow for a particular channel, — then a — FIFO 

5 overflow/underflow — error — will — be — flagged. ¥he — packet — being 

processed — by — the — synchronizer — at — the — time — erf — error — will — be- 

terminated . All data will be dropped from this point on. Inputs 

for this channel will be disabled until re - enabled by 0/W. FIFOs 

for this channel will be flushed. 

10 Fabric queue resync rs performed in order to- 

resynchronize memory controller queues. It is important that all 

fabrics — are processing the — stripe of the — same — cell — or packet — srt 
precisely the same logical cycle and that all — fabrics are acting 
together as one logical fabric. Fabric queue resync starts at the 

15 Stripers . The Otriper will receive a queue resync cell — from the 

control port. ¥he — striper will decode the queue — resync cell — and 

will — back — op — traffic — until — the — next — sequence — counter — tick — » 

reached. At — this — point, — it — will — send — a — fabric — queue — resync — ft 

character — immediately followed by the queue resync cell. At — thre- 

2 0 fabric, — the WGM in the receive synchronizer will receive the queue 
resync — K character — and notify the — Btrs — Controller — in — t-he — receive 
synchronizer that a queue resync cell is in the input FIFO and that 
the queue resync event — should occur — at the next — transition of the 
receive sequence counter. — The Dus Controller will then indicate to 

2 5 other Aggregators on the fabric that a resync cell event will take 

place — at — the — next — transition — of — the — sequence — counter . c E L fre 

indication is asserted about 10 cycles before the receive sequence 

counter transitions . This is done to allow e nough tim e for other 

Aggregators to see this assertion before their respective receive 
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sequence — counters — transition — also . Once — t-fre — sequence — count 

transition — occurs, — t-fre — Aggregators — will — signal — -bo — t+re — memory 
controllers — that — a queue — resync event — Iras — occurred and that — this 

event delimits old and new data. All data sent before — t-he — sync 

5 event is considered old data and all data sent after the sync event 

is considered new data. The memory controllers synchronize their 

buffers accordingly . The resync cell — is eventually sent — through 

the switch as a regular cell and returned to the control port. 

There can be times when, the gigabit network is changing 

its operating mode and the — switch is — changing — from a — 40/00 — to an 

00/120 — mode — for — example . There — is — rro — guarantee — that — Gigabit 

Receivers will be driven by Gigabit Transmitters during this time 

period. Aggregators — that — expect — good data — from certain — Gigabit 

Receivers may not get good data. If the switch is increasing its 

mode, then a previously unused FIFO will now be .used. If this FIFO 

has garbage data on its inputs, — then syncs will not be received and 
this FIFO will not be synced until the gigabit network is stable. 
Once the Gigabit network is stable, — idles and sync characters will 
be — transmitted by — t+re — port — cards — and the — FIFOs — will — have — enough 

time — to — sync — trp-. I-f — the — switch — i-s — decr e asing — tt-s — mode, — then 

previously used FIFOs will now be unused. — The Aggregator will know 
the new switch capacity and will eventually ignore these channel 
FIFOs. 

Wte — Unstriper — needs — to — provide — back - pressur e — tro — the 

2 5 Separators when internal FIFOs in the Unstriper become near — full . 
Each Separator will expect 24 separate back pressure signals coming 
from all — t-he — port — card — channels — i± — ars — connected — tm Wre — back- 
pressure signal is considered to be asynchronous to all ASICs. ft- 
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is required that all relevant Separators receive back 1 pressure from 
a particular channel in the Unstriper at precisely the same logical 
cycle . This — ars — done — by — having — the — Unstripers — assert — Hte — back- 
pressure — signal — when — their — receive — sequence — counter — transitions . 
5 it — i-s — assumed that — the Unstriper' s — receive — sequence — counter — i-s — a- 

delayed version of the Stripers transmit sequence counter. Since 

the tick length is 250 cycles and the receive counter is delayed by 
150 cycle relative to the transmit counter, there exists 100 cycles 
of margin to transport the back - pressure signal from the Unstriper 

10 to the Separator. The Separator needs about 10 cycles before the 

transition — of — rtrs — sequence — counter — to — sample — t+re — back - pressure 
signal . This will give the Separator enough time to provide back- 
pressure — to the memory controller before the counter transitions. 
This places a maximum requirement on the propagation delay of the 

15 back pressure signal. The following requirements hold true: 

Dack pressure — propagation — delay — < — counter — tick — length receive 

sync pulse delay setup time of Separator' — sample point 

Dack - pressure propagation delay < 250 i^rQ H> 

Dack-pressure propagation delay < 90 cycles @ 125 MHz or 720 ns 

2 0 Assuming worst case conditions, — the expect e d worst case 

propagation delay would be: 

Dack - pressure propagation delay - — (Unstriper — t-o — Striper — delay) — h 
(Striper to Aggregator delay) — i — Aggr e gator to Separator Delay 
Dack pressur e propagation delay - 1 5 cycles — (chip and board delay) 
2 5 H — (5 i G2 cycles) (chip and port card to fabric delay of 500 ns) — i — fr 
cycles — (chip and board delay) 
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Dack pressure propagation delay - 77 cycles — < 90 cycles 

fts esn — be seen from — this estimate, the — maximum 

back ' pressure propagation delay requirement is met. 

Assuming sriri the relevant Separators receive the 

5 back pressure — signal — before — the — transition — to — the — next — sequence 
count, — then it — can be — synchronized to the next — transition — of the 
transmit sequence counter. — This will allow all relevant Separators 
to stop sending valid data at precisely the same logical cycle for 
one — complete — count e r — tick — interval . This — w — true — since — it — is- 

10 assumed that when the transmit — sequence counter — transitions , — the" 
data that the Separators are sending are companion fragments of the 

same — packet . f-f — back pressure — w — sampled — again — before — t-h-e — next 

counter transition, — then data will be stopped for another counter 
tick interval. — This mechanism implies that back" pressure can only 

15 be generated on a counter tick length granularity. 

Since — there — rs — no — dir e ct — path — from — Unstriper — to 

Separator, — the back pressure signals need to be re - routed from the 
Unstriper, — to — the — Striper, — to — the — Aggregator — arrd — finally — to — the 
Separator . In order — to do this, — each Unstriper needs to send the 

2 0 back - pressure — signal — to — the — corresponding — Striper — on — that — port 

card. Wte — Striper — will — then — forward — the — back pressure — signal 

through — t+re — backplane — gigabit — transceivers — onto — the Aggregator. 
The Aggregator will forward up to 2-4 separate back pressur e signals 
to one Separator corresponding to G buses with 4 channels per bus. 

2 5 T+re — back -pressure — signal — will — always — tree — bit — 9 — of — the — gigabit 

transceivers . The — receive — synchronizer — block — in — the — Aggr e gator 

will forward the correct back - pressure signal for the appropriate 
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bus and channel to the Separator. Since the gigabit receivers are 

not dedicated to any particular bus and channel, — the synchronizer 
needs to select the correct gigabit receiver based on the — switch 

configuration just — like — it does — for regular data. Once — this — is- 

5 done, — bit 0 of the gigabit receiver is forwarded on as the back- 
pressure — signal . Note — that — bit — 6 — is — also — used — for — receiving — k 

characters and can change when sending a k character. In order to 

avoid mistakenly interpreting bit — 0 of a — k character as a valid 
back - pressure signal, — the synchronizer will only sample the back - 

10 pressure bit when valid data is received from the gigabit receiver. 
In the case where a k character is received, — the synchronizer will 
hold the back pressure signal at its current value. — There is still 
a — case — where — the — Strip e r — eem — be — sending — back to-back — idle 
characters since there is nothing to send. — If the Otriper needs to 

15 change the value of the back - pressure signal in this case, — then it 
will — send one — of — two — k characters — that — change — the back pressure 

value . The two k characters that will be used are a set and clear 

of — the — back-pressure — signal . ff — the — synchronizer — receives — cr 

back pressure — set — or — clear — character, — i± — will — set — or — clear — the 

2 0 back-pressure — signal — r e sp e ctively. ff — srrry — other — k character — irs- 

received, — the curr e nt back - pressure signal is r e tain e d. If valid 

data — is — received, — hrt — 0 of — the — appropriat e — gigabit — receiver — is- 
sampled as the back - pr e ssure signal. 

Although the invention has been described in detail in 
25 the foregoing embodiments for the purpose of illustration, it is to 
be understood that such detail is solely for that purpose and that 
variations can be made therein by those skilled in the art without 
departing from the spirit and scope of the invention except as it 
may be described by the following claims. 



WHAT IS CLAIMED IS: 



1. A switch for switching packets, each packet having a 
length, comprising : 

a port card which receives packets from and sends packets 
to a network; and 

fabrics connected to the port card which switch the 
packets, each fabric having a memory mechanism, each fabric having 
a mechanism for determining the length of each packet received by 
the fabric and placing a length indicator with the packet so when 
the packet is stored in the memory mechanism, the determining 
mechanism can identify from the length indicator how long the 
packet is and where the packet ends in the memory mechanism. 

2. A switch as described in Claim 1 wherein the 
determining mechanism includes an aggregator which receives packet 
fragments from the port card, determines the packet length and 
appends packet length information to the beginning of the packet in 
the length indicator. 

3. A switch as described in Claim 2 wherein the memory 
mechanism includes a memory controller, the aggregator sending the 
packet with the packet length information to the memory controller 
which stores the packet with the packet length information. 

4 . A switch as described in Claim 3 wherein the memory 
controller has a memory which has a wide cache buffer structure in 
which multiple packets are put into one word. 



-116- 



5. A switch as described in Claim 4 wherein the fabric 
includes a separator which reads the packets from the memory 
controller and extracts the packet length information from each 
packet to determine when each packet ends, and sends fragments of 
the packet to the port card. 

6. A switch as described in Claim 5 wherein the 
separator removes the packet length information from each packet 
before sending any fragments of each packet to an unstriper of the 
port card. 

7. A method for switching packets having a length 
comprising the steps of: 

receiving a packet at a port card of a switch; 

sending fragments of the packet to fabrics of the switch; 

receiving the fragments of the packet at the fabrics of 
the switch; 

measuring the length of the packet at each fabric from 
the fragments of the packet received at each fabric- 
appending a length indicator to the packet; 

storing the packet with the length indicator in a memory 
mechanism of the fabric- 
reading the packet from the memory mechanism; and 
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determining where the packet ends from the length 
indicator of the packet. 

8 . A method as described in Claim 7 wherein the 
receiving step includes the step of receiving the fragment at an 
aggregator of the fabric. 

9. A method as described in Claim 8 wherein the 
measuring step includes the step of measuring the length of the 
packet with the aggregator. 

10. A method as described in Claim 9 wherein the 
appending step includes the step of the appending the length 
indicator to the packet with the aggregator. 

11. A method as described in Claim 10 wherein the 
storing step includes the step of storing the packet with the 
length indicator in a memory controller of the memory mechanism. 

12. A method as described in Claim 11 wherein the 
reading step includes the step of reading the packet from the 
memory controller with a separator of the fabric. 

13. A method as described in Claim 12 wherein the 
determining step includes the step of determining where a packet 
ends from the length indicator with the separator. 

14. A method as described and Claim 13 including after 
the determining step, there is the step of removing the packet 
length information from the separator. 
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15. A method as described in Claim 14 including after 
the removing step, there is the step of sending fragments of the 
packets from the separator to the port card. 

16. A method as described in Claim 15 wherein the 
sending fragments step includes the step of sending fragments of 
the packet to the port card in a same logical time with 
corresponding fragments from other fabrics to the port card. 

17. A method as described in Claim 16 wherein the 
storing step includes the step of storing the fragments of the 
packet in a memory of the memory controller which has a wide cache 
buffer structure in which multiple packets are put into one word. 

18. A method as described in Claim 17 including after 
the reading step, there is the step of extracting the packet length 
information from the packet with a separator. 

19. A. method as described in Claim 18 wherein the 
receiving step includes the step of receiving the fragments of the 
packet from the fabrics with an unstriper of the port card. 

20. A method as described in Claim 19 wherein the 
sending fragments to the fabric step includes the step of sending 
with a striper of the port card to the aggregator of each fabric 
the fragments of the packet. 

21. A method as described in Claim 20 wherein the step 
of sending fragments to the port card includes the step of sending 
fragments from the separator to an unstriper of the port card. 



ABSTRACT OF THE DISCLOSURE 



TRANSFERRING AND QUEUEING LENGTH AND DATA AS ONE STREAM 

A switch for switching packets. Each packet has a 
length. The switch includes a port card which receives packets 
from and sends packets to a network-. The switch includes fabrics 
connected to the port card which switch the packets. Each fabric 
has a memory mechanism. Each fabric has a mechanism for 
determining the length of each packet received by the fabric and 
placing a length indicator with the packet so when the packet is 
stored in the memory mechanism, the determining mechanism can 
identify from the length indicator how long the packet is and where 
the packet ends in the memory. A method for switching packets 
having a length. The method includes the steps of receiving a 
packet at a port card of a switch. Then there is the step of 
sending fragments of the packet to fabrics of the switch. Next 
there is the step of receiving the fragments of the packet at the 
fabrics of the switch. Then there is the step of measuring the 
length of the packet at each fabric from the fragments of the 
packet received at each fabric. Next there is the step of 
appending a length indicator to the packet. Then there is the step 
of storing the packet with the length indicator in a memory 
mechanism of the fabric. Next there is the step of reading the 
packet from the memory mechanism. Then there is the step of 
determining where the packet ends from the length indicator of the 
packet . 



